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The  accurate  estimation  and  tracking  of  parameters  of  a time- 

varying  signal  are  of  importance  in  applications  of  signal  processing, 
such  as  geophysics,  seismology,  communications,  and  speech.  The 
research  results  developed  in  this  dissertation  bear  on  this  problem. 
A weighted  recursive  least  squares  algorithm  with  a variable  forgetting 
factor  (WRLS-VFF)  was  developed  to  estimate  the  parameters  of  an  auto- 
regressive and  moving  average  (ARMA)  process.  The  variable  forgetting 
factor,  which  controls  the  memory  used  by  the  estimation  algorithm,  can 
be  recursively  updated  during  the  adaptive  process.  Based  on  this 

variable  forgetting  factor,  a method  to  estimate  the  input  excitation 
signal  to  the  ARMA  process  was  also  proposed.  Experimental  results 
based  on  the  analysis  of  synthetic  signals  showed  that  the  WRLS-VFF 
algorithm  performed  better  than  block  processing  techniques  and  other 
recursive  methods  in  terms  of  estimation  accuracy  and  tracking  ability. 

The  WRLS-VFF  algorithm  was  applied  to  analyze  speech  signals. 

Two  analysis  techniques  were  developed  to  extract  vocal  tract 
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parameters  (formants)  and  excitation  source  (glottal  volume-velocity 
waveform)  from  speech  and  electroglot tographic  (EGG)  signals.  These 
techniques  are  (1)  adaptive  formant  tracking  using  the  glottal  closed 
phase  VRLS-VFF  algorithm  and  (2)  glottal  inverse  filtering  using  the 
glottal  closed  phase  WRLS-VFF  algorithm.  The  analyses  results  showed 
that  these  techniques  were  useful  when  the  speech  signal  had  high 
fundamental  frequency  (female  voice)  or  nasal  sounds. 

An  ARMA  cepstral  distance  measure,  based  on  the  WRLS-VFF 
algorithm  was  also  introduced.  The  coefficients  of  the  ARMA  cepstral 
distance  were  obtained  recursively  from  the  estimated  pole-zero 
coefficients  of  the  ARMA  process.  Simulation  results  showed  that  the 
ARMA  cepstral  distance  measure  gave  a better  representation  of  the 
spectrum  than  that  of  an  AR  cepstral  distance  measure  for 
analyzing  nasal  sounds. 

In  summary,  the  major  contributions  of  this  dissertation  were  the 
development  of  new  adaptive  parameter  estimation  algorithm  for  the 
estimation  of  time-varying  signal  parameters  for  an  ARMA  process  under 
nonstationary  signal  conditions,  and  the  application  of  this  algorithm 
to  speech  signal  analysis. 
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CHAPTER  1 


INTRODUCTION 

Problems  of  Parameter  Estimation 

Many  techniques  developed  in  signal  processing,  spectral 
estimation,  control,  and  time  series  analysis  are  based  either  on  an 
autoregressive  (AR)  model  or  on  the  more  general  autoregressive-moving 
average  (ARMA)  model.  The  ARMA  model  can  be  expressed  as  follows: 

P q 

yk  = E ai(k)yk_i  + E bi(k)uk_i  (1.1) 

i=l  i=o 

where  k is  the  time  index,  uk  is  a zero  mean  white  Gaussian  noise 
process  with  variance  au2,  yk  is  the  output  of  the  ARMA  process,  and 
ai(k),  bj(k)  are  the  time-varying  AR  and  MA  parameters,  respectively. 
Given  a sequence  of  observations  or  measurements  yk  assumed  to  satisfy 
(1.1),  it  is  desirable  to  estimate  the  parameters  a^)  and  b^k)  based 
on  the  available  data.  Most  past  work  on  this  ARMA  modeling  problem 
has  been  based  on  the  assumption  that  either  yk  is  a wide  sense 
stationary  process  and  hence  the  ARMA  parameters  do  not  depend  on  time, 
or  that  yk  is  a nonstationary  process  with  a^(k)  and  bj(k)  changing 
"slowly"  with  time. 

However,  in  many  practical  situations  the  statistical  properties 
of  the  underlying  signals  may  vary  rapidly  and  change  abruptly  at 
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unknown  times.  This  time  dependence  of  signal  characteristics  may  be 
the  result  of  a time-varying  medium  through  which  the  signal 
propagates,  or  it  may  be  due  to  the  nature  of  the  signal  itself. 
Nevertheless,  this  time-varying  information  is  essential  in  many 
applications.  For  example,  any  abnormalities  in  biomedical  signals 
such  as  evoked  potentials  (EP)  and  electroencephalogram  (EEG)  are 
attributed  to  pathological  changes.  Early  detection  of  such  changes  is 
crucial  in  a neurological  critical  care  unit  or  an  operating  room.  In 
speech  signal  analysis,  speech  parameters  such  as  vocal  tract 
resonances  (formants)  and  their  bandwidths  may  change  rapidly  for 

different  sounds  or  the  transition  from  one  sound  to  another. 

Accurately  tracking  these  parameters  is  necessary  for  the  development 
of  automatic  speech  and  speaker  recognition  systems  and  for  speech 
synthesis.  Other  examples  include  seismic  signals  which  may  reflect 

the  layered  media  changes  and  signals  which  are  used  for  failure 

detection  in  various  electromechanical  systems. 

Therefore,  the  problem  of  estimating  the  time-varying  signal 
parameters  can  be  stated  as  follows: 

(1)  How  to  estimate  accurately  the  signal  parameters,  e.g.,  a^ 
and  bj  in  the  ARMA  process,  under  a stationary  environment. 

(2)  How  to  track  the  signal  parameters  when  they  are  changing 
fast  with  time. 

(3)  How  to  detect  any  abrupt  changes  in  the  signal  parameters. 

Research  Issues 

Most  existing  methods  for  parameter  estimation  can  solve  only  one 
or  two  of  the  problems  stated  above.  Block  data  processing  techniques 
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such  as  the  autocorrelation  method  in  linear  predictive  coding  (LPC) 
[Makhoul,  1976],  maximum  entropy  [Burg,  1975]  and  modified  Yule-Walker 
equation  methods  [Cadzow,  1982],  to  name  just  a few,  can  estimate  the 
signal  parameters  accurately  only  when  the  underlying  signal  is 
stationary.  Due  to  the  use  of  windows  that  average  the  data  over 
analysis  frames  (blocks)  [Kalouptsidis  and  Theodoridis,  1987],  these 
techniques  can  neither  track  the  parameter  variations  nor  detect  the 
abrupt  changes  in  parameters  for  a nonstationary  process. 

Sequential  adaptive  algorithms  such  as  least  mean  squares  ( LMS) 
[Widrow  et  al.,  1975],  recursive  least  squares  (RLS)  [ Fr iedlander , 
1982a;  Cowan  and  Grant,  1985],  and  least  squares  lattice  ( LSL)  methods 
[Lee  et  al.,  1981],  have  been  developed  and  used  for  estimating  time- 
varying  signal  parameters  for  many  years.  These  methods  not  only 
provide  an  accurate  estimate  for  a stationary  process  but  also  track 
slowly  changing  model  parameters  [Haykin,  1985].  However,  it  has  been 
observed  [Hu  and  Abdallah,  1987]  that  these  existing  methods  still 
suffer  a number  of  deficiencies  which  combine  to  limit  overall  system 
performance  dramatically.  An  obvious  deficiency  is  the  lack  of  an 
appropriate  mechanism  to  choose  and  update  secondary  parameters.  Here, 
secondary  parameters  refer  to  those  conditions  which  characterize  the 
class  of  mathematical  models  whose  coefficients  are  to  be  adaptively 
updated.  Examples  of  secondary  parameters  include  model  type,  model 
order,  step  size,  weighting  (forgetting)  factor,  etc.  In  current 
adaptive  systems,  secondary  parameters  are  selected  in  advance  and  most 
of  them  remain  unchanged  throughout  the  adaptive  process.  While  this 
may  be  acceptable  for  time  invariant  systems,  it  is  prone  to  fail  for 
time-varying  parameter  tracking  problems.  The  reason  for  this  is  that 
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secondary  parameters  provide  a way  of  dynamically  controlling  the 
performance  of  the  algorithm.  Naturally,  it  is  desirable  to  make  the 
parameters  adapt  to  the  variation  of  data,  especially  under  a 
nonstationary  environment. 

Based  on  the  above  discussion,  a robust  method  for  estimating  and 
tracking  the  time-varying  signal  parameters  under  either  a stationary 
or  nonstationary  environment  can  be  designed  as  in  Figure  1.1.  The 
procedures  are 

(1)  Design  a sequential  (recursive)  adaptive  algorithm  for  the 
accurate  estimation  and  rapid  tracking  of  the  model  parameters. 

(2)  Select  and  update  secondary  parameters  continually  during  the 
adaptive  process.  This  can  be  done  by  first  assuming  some  initial 
parameter  estimates  and  then  letting  the  recursive  algorithm  operate  on 
the  incoming  data  to  estimate  either  the  prediction  error  or  the 
spectral  error.  If  the  error  is  small,  which  is  often  the  case  in  a 
stationary  environment,  the  current  parameters  should  be  retained. 
Otherwise,  the  secondary  parameters  should  be  updated  based  on  some 
error  information  criteria. 

Updating  all  the  secondary  parameters  simultaneously  is  very 
difficult.  This  is  due  to  the  absence  of  a mathematical  model  and  the 
fact  that  the  parameters'  space  cannot  be  explicitly  expressed.  Hu  and 
Abdallah  [1987]  suggested  that  some  artificial  intelligence  (AI) 
techniques  may  be  applicable  to  search  and  select  these  secondary 
parameters,  which  requires  the  generation  of  a knowledge  based  expert 
system.  Fortunately,  not  every  secondary  parameter  has  to  be  updated 
during  the  execution  of  the  adaptive  algorithm.  The  selection  of  the 
secondary  parameters  to  be  updated  depends  on  the  characteristics  of 
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Figure  1.1  Adaptive  parameter  estimation  method. 
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the  underlying  signals.  In  this  study,  we  solve  the  problem  of 
selecting  secondary  parameters  by  (1)  use  of  a general  ARMA  model  for 
parameter  estimation,  (2)  off-line  selection  of  the  model  order,  and 
(3)  on-line  updating  of  the  forgetting  factor.  The  reasons  for  this 
approach  are  discussed  below. 

Parameter  Estimation  Based  on  an  ARMA  Model 

In  general,  the  underlying  signal  can  be  modeled  either  as  an  AR 
process  or  as  an  ARMA  process.  This  is  discussed  in  detail  in  Chapter 
2.  Although  the  AR  or  all-pole  model  has  been  used  quite  successfully 
in  many  applications  such  as  EEG,  seismology,  sonar  and  speech  signal 
processing,  there  are  some  situations  where  the  inadequacies  of  the  AR 
modeling  technique  may  be  compensated  for  by  the  use  of  the  ARMA  model. 
For  example,  in  speech  signal  processing,  the  ARMA  or  pole-zero  model 
is  the  most  appropriate  model  for  estimating  the  spectrum  of  nasal  and 
consonant  sounds.  This  is  because  spectral  dips  (zeros)  can  arise  when 
the  nasal  tract  is  coupled  to  the  main  vocal  tract  through  the  velar 
opening  as  in  the  case  of  nasal  sounds  /m/,  /n/,  or  if  the  source  of 
excitation  is  not  at  the  glottis  but  is  in  the  interior  of  the  vocal 
tract  as  in  the  case  of  consonants  [Atal  and  Schroeder,  1978]. 
Furthermore,  it  is  well-known  that  an  AR  process  corrupted  by  noise  can 
be  modeled  by  an  ARMA  process  [Done  and  Rushforth,  1979].  Poles  of  the 
ARMA  model  are  identical  to  those  of  the  original  AR  model.  Thus,  the 
spectral  envelope  of  the  noisy  signal  can  be  estimated  accurately  by 
using  the  pole  predictor  of  the  pole-zero  model  obtained. 

According  to  the  above  point  of  view,  using  an  ARMA  model  is  more 
appropriate  in  the  development  of  the  adaptive  estimation  algorithm. 
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For  AR  parameter  estimation,  it  is  simple  to  set  the  MA  part  of  ARMA 
estimation  algorithm  equal  to  zero. 

Off-Line  Selection  of  the  Model  Order 

To  select  the  order  of  an  ARMA  model  at  each  step  during  the 
adaptive  process  is  difficult  and  time  consuming.  Fortunately,  the 
physical  model  of  the  signal  generation,  the  data  base  derived  from  the 
analysis  results  and  the  preprocessing  of  the  signal,  can  often  serve 
as  a good  reference  to  determine  the  model  order.  In  order  to  simplify 
the  development  of  the  adaptive  algorithm,  we  use  some  historical 
information  and  preprocessing  of  the  data  to  determine  the  model  order 
in  this  study. 

On-Line  Updating  the  Forgetting  Factor 

The  forgetting  factor  in  recursive  algorithms,  such  as  RLS  and 
LSL,  controls  the  effective  memory  length  used  in  the  adaptive  process. 
The  reasons  for  employing  the  forgetting  factor  in  these  algorithms  are 
as  follows:  (1)  For  a stationary  process,  the  recursive  algorithm  can 
provide  an  unbiased  estimate  by  choosing  a forgetting  factor  close  to 
unity,  meaning  that  the  algorithm  uses  all  of  the  previous  information 
from  the  signal  during  the  adaptive  process,  and  (2)  under  a 
nonstationary  environment,  fast  tracking  of  parameter  changes  and  fast 
convergence  of  the  adaptation  process  can  be  achieved  by  selecting  a 
small  forgetting  factor  [Cowan  and  Grant,  1985],  The  main  problem  in 
the  adaptive  process  is  that  we  do  not  have  a priori  information  about 
when  the  signal  process  is  stationary,  nonstationary  or  in  between. 
For  example,  speech  signals  may  contain  stationary  parts  as  in  the 
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vowel  sound  as  well  as  nonstationary  parts  as  in  the  vowel  to  consonant 
transition  region.  In  such  cases,  a robust  way  to  select  the 
forgetting  factor  is  to  let  the  adaptive  algorithm  self-adjust  or 
update  the  forgetting  factor  according  to  the  signal  conditions. 

Research  Objectives 

Based  on  the  above  discussion  and  assumptions,  the  main  objective 
of  this  study  is  to  develop  an  adaptive  algorithm  with  a variable 
forgetting  factor  for  ARMA  parameter  estimation.  This  algorithm  should 
produce  an  unbiased  estimate  for  stationary  signal  analysis  and  track 
fast  parameter  changes  for  nonstationary  signals.  A secondary 
objective  is  to  apply  this  algorithm  to  estimate  and  track  signal 
parameters  for  speech  signal  processing. 

Description  of  Chapters 

Chapter  2 presents  the  background  materials  relevant  to  this 
thesis  and  reviews  the  existing  AR  and  ARMA  parameter  estimation 
algorithms  for  both  block  data  processing  and  adaptive  estimation.  The 
weighted  recursive  least  squares  algorithm  with  a variable  forgetting 
factor  (VRLS-VFF)  is  developed  in  Chapter  3.  The  criteria  to  select 
the  model  order  and  the  stability  analysis  of  the  WRLS-VFF  algorithm 
are  also  discussed  at  the  end  of  this  chapter.  Chapter  4 has  two 
parts.  In  part  1,  we  describe  the  implementation  procedures.  In  part 
2,  we  generate  several  synthetic  signals  based  on  both  AR  and  ARMA 
models  and  use  these  signals  to  evaluate  the  performance  of  the  VRLS- 
VFF  algorithm.  A comparison  of  results  between  the  VRLS-VFF  and  other 
algorithms  such  as  weighted  Kalman,  sequential  ARMA,  general  VRLS  and 
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weighted  LSL  is  given.  Chapter  5 contains  three  applications  using  the 
VRLS-VFF  algorithm  for  speech  signal  analysis.  In  the  first 
application,  the  WRLS-VFF  algorithm  is  applied  to  track  the  formants 
and  bandwidths  of  the  speech  signals.  The  closed  phased  approach 
combined  with  this  algorithm  is  used  to  analyze  different  types  of 
speech  signals  such  as  vowels,  nasals,  and  consonants.  Both  isolated 
words  and  sentences  are  tested.  In  the  second  application,  a glottal 
inverse  filtering  method  based  on  the  VRLS-VFF  algorithm  is  developed. 
Finally,  a cepstral  distance  measure  obtained  from  the  VRLS-VFF 
algorithm  is  introduced  and  compared  with  general  AR  cepstral  distance 
measure.  Chapter  6 concludes  the  study  with  a discussion,  summary  of 
results  and  recommendations  for  future  research. 


CHAPTER  2 


ISSUES  ON  PARAMETER  ESTIMATION 
Introduction 

A sequence  is  an  ARMA  sequence  if  it  is  generated  according  to 
the  relationship,  as  stated  in  (1.1) 

P q 

1 ai  yk-i  = 1 b}  uk_if  a0=l  (2.1) 

i=o  i=o 

where  uk  is  a zero  mean  unit-variance  white  noise  sequence.  This  set 
of  linear  equations  describes  the  input-output  relationship  of  a linear 
system,  as  shown  in  Figure  2.1.  Note  that  the  ARMA  model  reduces  to  an 
AR  model  for  q=0,  as  shown  in  Figure  2.2.  The  transfer  function  H(z) 
of  the  ARMA  process  is  expressed  as 

q P 

H(z)  = B(z)/A(z)  = I bj  z 1 / E aj  z~1  (2.2) 

i=o  i=o 

where  A(z)  and  B(z)  represent  the  transfer  functions  of  the  AR  and  the 
MA  parts,  respectively.  Multiplying  yk_j  on  the  both  sides  of  Equation 
(2.1)  and  taking  the  expected  value  gives 

P q 

E ai  r-j^  = E bi  Pj _i , a0=l  (2.3) 

i=o  i=o 

where  rj_j  = E[yjc_j  yjc_j]  and  pj_^  = E[yjc_j  u^jj.  This  system,  known 
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11 


Figure  2.2  AR  model  with  order  p. 
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as  the  Yule-Walker  equations,  involves  the  autocorrelation  sequence  rj 
of  the  system's  output  as  well  as  the  cross-correlation  sequence  pj 
between  the  system's  input  and  output. 

For  an  AR  model  with  order  p,  pj=0  holds  for  all  j,  then  Equation 
(2.3)  reduces  to 


P 

Z 

i=o 


J-i 


= 1,  j = 1 » 2 , 


(2.4) 


The  AR  parameter  estimation  problem  then  relies  on  how  to  estimate  rj 
from  the  available  data  yk  and  solve  aj  from  Equation  (2.4)  using  some 
fast  and  efficient  methods,  which  are  discussed  in  the  next  section. 

The  ARMA  estimation  problem  entails  solving  the  Yule-Walker 
equations  for  the  parameter  sets  [aj,  a2,...,ap]  and  [bj,  b2,...,bq]. 
If  the  system  is  causal,  y^  is  uncorrelated  with  ujc+i,  for  i>0,  then  pj 
= 0 for  j>q.  Equation  (2.3)  can  be  written  as 

P 

Z ai  rj-i  = °*  ao=1»  j=q+l,  q+2 , . . . (2.5) 
1=0 


Equation  (2.5)  is  called  the  extended  (modified)  Yule-Walker 
equation  (MYWE)  [Cadzow,  1982]  and  it  invloves  a linear  combination  of 
the  parameters  aj  alone.  This  suggests  that  the  AR  and  MA  parameters 
can  be  solved  for  separately.  We  will  review  several  methods  based  on 
Equation  (2.5)  in  the  following  sections. 


Parameter  Estimation  Algorithms  Based  On  Block  Data  Processing 
The  following  sections  review  the  parameter  estimation 
algorithms,  based  on  a block  of  data,  for  the  AR,  MA  and  ARMA 
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processes,  respectively.  In  other  words,  all  the  data  are  available 
before  executing  the  estimation  process. 

AR  Parameter  Estimation 

A direct  approach  to  parameter  (order  p)  estimation  is  first  to 
estimate  the  p autocorrelation  lags  from  the  block  data  and  then  solve 
the  Yule-Walker  equations  using  the  Levinson  recursion  algorithm 
[Levinson,  1947].  This  algorithm  requires  a number  of  computational 
operations  proportional  to  p2.  The  most  popular  approach  for  AR 
parameter  estimation  that  guarantees  a stable  filter  is  Burg's 
algorithm  [1975],  also  known  as  the  maximum  entropy  method  (MEM).  This 
technique  chooses  the  best  set  of  AR  parameters  from  the  given  data 
samples  by  minimizing  the  sum  of  the  average  energies  of  the  forward 
and  backward  linear  prediction  errors,  subject  to  the  constraint  that 
the  optimum  set  of  parameters  satisfies  the  Levinson  recursion.  While 
Burg's  algorithm  provides  increased  spectral  resolution  over 
conventional  methods  [Welch,  1967]  and  the  Yule-Walker  technique  (based 
on  biased  autocorrelation  estimates),  the  AR  spectrum  estimated  on 
Burg  s algorithm  suffers  mainly  from  two  disadvantages:  the  spectral 

line  splitting  under  high  model  orders  and  moderate  signal-to-noise 
ratio  [Fougere,  1977;  Chen  and  Stegen,  1974]  and  the  bias  in  the 
positioning  of  spectral  peaks  [Helme  and  Nikias,  1985]. 

Alternative  methods  which  compute  the  AR  parameters  by  minimizing 
the  sum  of  forward  and  backward  error  energies  of  the  pth  order 
predictor  with  respect  to  all  the  reflection  coefficients  up  to  this 
order  have  also  been  devised  [Goutis  and  Ibrahim,  1983;  Fougere,  1977]. 
Although  these  methods  alleviate  the  problems  of  spitting  and  bias, 
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they  result  in  nonlinear  systems  of  equations  requiring  time  consuming 
iterative  procedures  for  their  solution. 

Ulrych  and  Clayton  [1976]  and  Nuttal  [1976]  independently 
suggested  a method  that  estimates  the  AR  parameters  by  direct 
minimization  of  the  sum  of  forward  and  backward  error  energies  with 
respect  to  the  unknown  parameters.  This  method,  which  we  shall  call 
the  forward-backward  least  squares  (FBLS)  method,  alleviates  the 
problems  of  splitting  and  positioning  bias,  and  at  the  same  time  is 
faster  compared  to  the  above  iterative  techniques.  The  main 
disadvantage  of  this  method  as  compared  to  Burg's  algorithm  is  its 
computational  complexity  requiring  0(p3)  multiplications  and  additions, 
while  Burg's  algorithm  requires  only  0(p2),  because  it  takes  advantage 
of  the  Toeplitz  property  of  the  corresponding  prediction  equations. 
The  computational  complexity  of  the  FBLS  method  was  later  reduced  by 
Marple  [1980],  who  pointed  out  that  the  associated  matrix  for  the 
linear  system  of  equations  also  has  a special  structure  and  is  the  sum 
of  a near  to  Teoplitz  and  a near  to  Hankel  matrix.  Marple  proposed  an 
algorithm  with  a computational  complexity  comparable  to  that  of  Burg's, 

, n 

i.e.,  0(p  ).  The  combination  of  its  good  performance  and  relatively 
low  complexity  has  made  this  algorithm  one  of  the  most  successful  for 
AR  parameter  estimation  [Kay  and  Marple,  1981]. 

ARMA  Parameter  Estimation 

The  particular  solution  procedures  applied  to  the  Yule-Walker 
equation  (2.3)  serves  to  dichotomize  ARMA  algorithms  into  a class  that 
estimates  the  AR  and  MA  parameters  simultaneously  and  a class  that 
estimates  the  AR  and  MA  parameters  separately.  The  first  class 
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includes  the  prefiltering  method  [Steiglitz,  1977;  Konvalinka  and 
Matausek,  1979]  and  maximum  likelihood  techniques  [Kay,  1987].  In  this 
class  of  solutions,  however,  the  resulting  equations  are  nonlinear  and 
require  iterative  optimization  techniques.  Thus,  although  this  class 
of  algorithms  provides  good  spectral  envelope  estimation,  their 
computational  complexity  and  convergence  issues  make  their  applications 
in  real  time  signal  processing  difficult. 

In  the  second  class,  suboptimal  techniques  have  been  developed  to 
significantly  reduce  the  computational  complexity.  The  techniques  are 
usually  based  on  a least  squares  criterion  and  require  solution  of 
linear  equations.  The  AR  parameters  are  typically  estimated  first, 
independently  of  the  MA  parameters,  by  some  variant  of  the  modified 
Yule-Walker  equation.  The  MA  parameters  are  subsequently  estimated  in 
one  of  several  ways,  as  will  be  discussed  on  the  next  section. 

The  basic  idea  behind  the  MYW  method  is  quite  simple  and  has  been 
presented  in  various  references  [Gersh,  1970;  Box  and  Jenkins,  1970]. 
However,  straightforward  application  of  the  method  as  proposed  in  the 
above  references  can  lead  to  poor  performance  especially  for  short  and 
noisy  data  records.  An  alternative  approach  examined  by  Cadzow  [1982] 
and  Porat  and  Friedlander  [1985]  is  to  use  more  than  p (AR  order) 
equations  for  lags  greater  than  q (MA  order)  and  minimize  a squared 
error  to  fit  p AR  parameters.  Such  a procedure  then  becomes  a least 
squares  (also  called  over-determined)  modified  Yule-Walker  technique. 
These  techniques  significantly  improve  the  quality  of  the  estimates  as 
well  as  reduce  the  computational  complexity  that  make  it  practically 


useful . 
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In  speech  signal  analysis,  the  input  to  the  vocal  tract  filter 
can  be  either  a periodic  pulse  train  (simulating  a fundamental 
frequency  of  voicing)  or  white  noise  (simulating  turbulence  noise  as 
for  fricatives).  In  order  to  reduce  the  effect  of  periodicity,  a 
homomorphic  (cepstral)  processing  approach  for  ARMA  parameter 
estimation  has  been  proposed  by  many  researchers.  Childers  et  al. 
[1981]  suggested  that  the  impulse  response  of  the  vocal  tract  (ARMA 
model)  can  be  estimated  first  from  the  speech  signal  using  the  cepstral 
processing.  Then  the  impulse  response  is  used  to  compute  the  cross- 
correlation in  a set  of  equations  that  can  be  solved  for  the  ARMA 

parameters  simultaneously.  Yegnanarayana  [1981]  proposed  a cepstral 
matching  method  that  decomposes  the  poles  and  zeros  of  the  short-time 
spectrum  of  the  speech  signal  in  the  cepstral  domain.  According  to 
Yegnanarayana' s paper,  the  group  delay  of  a minimum  phase  signal  can  be 
evaluated  by  a set  of  cepstral  coefficients.  The  positive  and  negative 
peaks  in  the  negative  derivative  of  phase  spectrum  (group  delay) 
correspond  to  complex  poles  and  complex  zeros  of  the  signal, 
respectively.  Separation  of  the  positive  and  negative  portions  of  the 
group  delay  enables  the  computation  of  cepstral  coefficients 
corresponding  to  poles  and  zeros  and  thus  decomposing  the  signal  into 
its  pole  part  and  zero  part.  The  ARMA  model  coefficients  correspond  to 
the  poles  and  zeros  then  can  be  obtained  by  the  cepstral  to  AR 
coefficients  transformation  [Markel  and  Gray,  1976]. 

Although  the  cepstral  processing  techniques  avoid  the  problems 
of  pitch-synchronization  and  provide  a good  estimate  of  the  pole-zero 
spectrum,  their  complexity  is  inevitable  because  for  each  set  of  ARMA 
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parameters,  four  FFT  (Fast  Fourier  Transform)  operations  are  required 
to  get  the  cepstral  coefficients,  impulse  response  and  group  delay. 

MA  Parameter  Estimation 

The  most  obvious  approach  to  estimate  the  MA  parameters  is  to 
solve  the  nonlinear  equations  using  the  autocorrelation  estimates 
obtained  from  the  data.  This  method  involves  difficult  spectral 
factorization  techniques  [Friedlander  and  Porat,  1984].  An  alternative 
approach  is  to  estimate  the  MA  parameters  from  the  residual  sequence 
obtained  from  an  A R inverse  filter  assuming  that  there  remains  only  MA 
information  in  the  sequence.  This  MA  process  can  be  approximated  by  an 
AR  process  of  finite  (high)  order.  The  MA  parameters  are  then  obtained 
from  parameter  conversion  by  either  polynomial  division  [Cadzow,  1982] 
or  spectral  inversion  [Durbin,  1959;  Kopec  and  Oppenheim,  1977; 
Yanagida  and  Kakusho,  1986]. 

Recursive  (Sequential)  Techniques  for  Parameter  Estimation 

Recursive  techniques  have  many  advantages  in  situations  where 
large  amounts  of  data  need  to  be  processed  in  real  time  or  "nearly  real 
time."  The  advantages  include  the  following:  1)  Obtaining  parameter 

estimates  quickly.  A recursive  algorithm  provides  estimates 
sequentially  with  time;  it  does  not  have  to  wait  until  all  the  data  are 
collected.  2)  Providing  a computationally  efficient  method  for 
tracking  time-varying  parameters.  Batch  techniques  require  a fixed 
amount  of  computation  for  each  update  of  the  parameter  estimates.  The 
more  often  these  updates  are  performed,  the  more  computation  is 
required.  Recursive  techniques  automatically  provide  a point-by-point 
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update.  3)  Recursive  techniques  require  relative  little  memory/storage 
since  the  data  are  processed  one  point  at  a time.  4)  Recursive 
techniques  are  also  better  suited  for  parallel  processing 
implementations.  This  makes  them  attractive  candidates  for 
microprocessor  and  VLSI  based  systems  [Friedlander,  1982a]. 

In  the  following  sections,  we  will  discuss  the  basics  of  adaptive 
filtering  theory  used  for  adaptive  parameter  estimation,  the  recursive 
algorithm  used  in  the  adaptive  estimation,  the  existing  recursive 
algorithms,  the  constraints  on  the  recursive  algorithm  for  the  specific 
application,  and  ways  to  improve  the  recursive  algorithm  for  those 
applications . 

Parameter  Estimation  Based  on  Adaptive  Filtering  Theory 
Basics  of  Adaptive  Filtering  Theory 

The  basic  structure  of  an  adaptive  filter  is  shown  in  Figure  2.3. 
The  adaptive  filter  (adaptive  algorithm  + programmable  filter)  operates 
by  filtering  the  input  signal  sk  with  a programmable  filter  to  yield  an 
output  yk.  This  yk  is  compared  against  a desired  condition  or  training 
signal,  dk,  to  yield  an  error  signal,  ek.  This  error  is  then  used  by 
the  adaptive  algorithm  to  adjust  the  weights  in  the  programmable 
filter  to  minimize  the  difference  (or  the  error)  between  the  filter 
output  yk  and  the  desired  input  dk.  Based  on  the  choice  of  the  desired 
signal,  the  adaptive  filter  can  be  used  for  different  applications  such 
as  adaptive  noise  cancelling,  echo  cancelling,  adaptive  beamforming, 
parameter  modeling,  and  system  identification  [Haykin,  1985;  Cowan  and 
Grant,  1985].  Two  common  types  of  programmable  filters  can  be  used  in 
the  design  of  the  adaptive  filter,  namely,  finite  impulse  response 
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desired  output 


Figure  2.3  Block  diagram  of  the  adaptive  filter. 


measured  signal 


Figure  2.4  Block  diagram  of  the  adaptive  linear  predictor. 
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(FIR)  and  infinite  impulse  response  (HR)  filters.  The  choice  of  the 
adaptive  algorithm  is  determined  by  various  factors  such  as  rate  of 
convergence,  misadjustment , robustness,  and  computational  requirement. 
In  the  remainder  of  this  study  we  will  be  primarily  concerned  with 
using  the  adaptive  filter  for  parameter  estimation.  We  discuss  two 
adaptive  structures  for  parameter  estimation  in  the  following  section, 
namely,  the  adaptive  linear  predictor  and  the  adaptive  system 
identification. 

Adaptive  Linear  Predictor 

The  adaptive  linear  predictor  shown  in  Figure  2.4  can  be 
expressed  as 


P 

ek  = yk  - £ wi (k)y(k-i ) (2.6) 

i = l 

where  w4  is  the  coefficient  of  the  adaptive  filter.  Note  the 
similarity  between  Figure  2.2  and  Figure  2.4.  If  we  assume  that  the 
error  signal  in  (2.6)  is  equivalent  to  the  input  white  noise  (uk)  of  an 
AR  process,  then  the  adaptive  linear  predictor  can  provide  the  same 
spectral  representation  as  an  AR  model  [Alexander,  1986].  At 
convergence,  the  weighted  coefficients  (wj)  of  the  adaptive  linear 
predictor  will  approach  to  the  AR  coefficient  (a^). 

Adaptive  Modeling  and  System  Identifica t i on 

An  adaptive  filter  can  be  used  in  parametric  modeling  and  system 
identification,  that  is,  imitating  the  behavior  of  a physical  dynamic 
system  which  may  be  regarded  as  an  unknown  black  box. 


For  example,  in 
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Figure  2.5  Block  diagram  of  the  adaptive  system  identification. 
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speech  signal  processing,  the  black  box  can  be  considered  as  the  vocal 
tract  system.  Modeling  of  a single-input,  single-output  dynamic  system 
is  illustrated  in  Figure  2.5.  Both  the  unknown  system  and  the  adaptive 
filter  are  driven  by  the  same  input.  The  adaptive  filter  adjusts 
itself  with  the  goal  of  causing  its  output  to  match  that  of  the  unknown 
system,  generally  causing  its  output  to  be  the  best  least  squares  fit 
to  that  of  the  unknown  system.  The  unknown  system  can  be  configured  as 
an  all-pole,  an  all-zero  or  a pole-zero  model  depending  upon  the 
characteristics  of  the  signal  generation  process  of  the  system. 

Review  of  Adaptive  Parameter  Estimation  Algorithms 
Adaptive  AR  Parameter  Estimation 

The  problem  of  recursively  estimating  of  AR  parameters  has  been 
considered  by  a number  of  authors  [Ljung,  1981;  Sharman  and 
Friedlander , 1984;  Kalouptsidis  and  Theodoridis,  1984].  The  two  major 
approaches  for  updating  the  parameters  of  an  adaptive  AR  model  (FIR 
filter)  are  the  gradient  search  approach  [Vidrow  et  al.,  1975]  and  the 
recursive  least  squares  method  [Lee,  1974;  Cowan  and  Grant,  1985]. 

The  least  mean  squares  (LMS)  algorithm,  which  is  based  on  a 
gradient  search  technique  for  minimizing  a quadratic  performance 
function,  has  been  used  for  many  signal  processing  applications.  Two 
general  implementation  structures  using  the  LMS  algorithm  are  the 
tapped  delay  line  filter  and  adaptive  lattice  filter. 

The  main  advantages  of  the  LMS  algorithm  with  the  tapped  delay 
line  structure  are  its  computational  simplicity  and  ease  of 
implementation.  These  advantages  make  the  LMS  algorithm  an  attractive 
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solution  for  many  practical  problems.  However,  there  are  some 
drawbacks  such  as  the  need  for  a priori  knowledge  of  incoming  signal 
statistics  to  choose  the  proper  convergence  factor.  This  a priori 
information  may  not  always  be  available  [Yassa,  1987].  Meanwhile,  the 
major  problem  of  the  LMS  for  a transversal  filter  is  that  the  rate  of 
convergence  is  sensitive  to  variations  in  the  eigenvalue  spread, 
defined  as  the  ratio  of  the  maximum  to  minimum  eigenvalues  of  the 
signal  correlation  matrix  [Haykin,  1985].  Typically,  a wide  spread  in 
the  eigenvalues  results  in  a slow  convergence  of  the  LMS  algorithm. 

Griffiths  [1977]  extended  the  LMS  algorithm  to  finite  impulse 
response  (FIR)  lattice  structures.  The  algorithm  of  Griffiths  consists 
of  updating  two  sets  of  coefficients,  namely,  the  reflection  factors 
and  tap  weights.  The  major  advantage  of  the  lattice  structure  lies  in 
the  improved  convergence  rate  over  the  conventional  tapped  delay  line 
because  of  the  inherent  orthogonalization  of  the  prediction  error 
sequence  at  each  stage.  In  this  algorithm,  it  is  stated  that  the 
lattice  performs  better  than  the  tapped  delay  line  for  input  signals 
with  widespread  eigenvalues.  Additionally,  the  numerical  performance 
of  the  lattice  structures  has  been  shown  to  be  more  robust  than  that  of 
the  tapped  delay  line. 

The  convergence  properties  of  the  two  LMS  algorithms  discussed 
above  are  highly  dependent  on  the  value  of  the  convergence  factor  (step 
size  "u"),  which  unfortunately,  also  determines  the  "noise"  in 
adaptation  due  to  the  stochastic  nature  of  the  algorithm.  A large 
value  of  "u"  will  ensure  faster  convergence,  but  will  also  introduce  a 
large  amount  of  "misadjustment"  in  the  adaptive  process.  One  way  to 
achieve  optimal  convergence  properties  for  adaptive  AR  parameter 
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estimation  is  to  use  the  recursive  (sequential)  least  squares  (RLS) 
algorithm,  however  at  the  cost  of  increased  computational  complexity. 

The  conventional  implementation  of  a RLS  algorithm  for  AR 
parameter  estimation  requires  0(p2)  MADPR  (multiplications  and 
divisions  per  resursion)  [Carayannis  et  al.,  1983],  where  p is  the 
number  of  estimated  parameters.  This  complexity  makes  their  real  time 
implementation  a fairly  difficult  task  even  with  recent  hardware.  In 
contrast,  a variety  of  fast  RLS  algorithms  offers  a computational 
complexity  proportional  to  the  number  of  estimated  parameters.  For 
example,  the  fast  Kalman  algorithm  [Falconer  and  Ljung,  1978]  requires 
0(8p)  MADPR,  the  fast  a posteriori  error  sequential  technique  requires 
0(5p)  MADPR,  and  the  exact  least  squares  lattice  [Lee  et  al.,  1981] 
requires  0(15p),  all  for  AR  parameter  estimation. 

Adaptive  ARMA  parameter  estimation 

Two  approaches  are  used  for  recursive  ARMA  parameter  estimation, 
namely,  a recursive  maximum  likelihood  method  (RML)  [Astrom,  1980; 
Friedlander,  1982]  and  a model  reference  adaptive  system  identification 
method  (MRAD)  [Landau, 1976 ; Miyanaga  et  al.,  1982]. 

The  RML  algorithm  is  based  on  the  approximate  maximization  of  the 
likelihood  function  of  the  observed  data  sequence,  Yn  = (yn-l»  yn_2> 
yn-pi>  parameterized  by  an  ARMA  coefficient  vector.  The  key  part 
of  the  RML  algorithm  is  that  the  filtered  data  vector  used  in  the 
recursive  equations  is  obtained  by  passing  the  measured  and  estimated 
data  vectors  through  a prefilter  1/B(kz"1),  where  B(z)  is  the  z- 
transform  of  the  MA  process.  The  prefilter  is  obtained  from  the  MA 
part  of  the  estimated  parameter  vector  of  the  ARMA  process.  For  the 
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standard  RML  algorithm  k is  equal  to  1 [Soderstrom  et  al.,  1978].  But 
the  disadvantage  associated  with  it  is  that  continuous  monitoring  of 
the  stability  of  the  "prefilter"  is  needed.  Moreover,  for  a narrow 
band  signal  (like  speech),  where  roots  are  close  to  the  unit  circle, 
the  system  response  becomes  very  slow;  hence  the  algorithm  takes  a long 
time  to  converge.  Friedlander  [1982]  suggested  a "modified  prefilter" 
with  value  of  k < 1.  This  has  the  effect  of  pulling  the  roots  towards 
the  center  of  the  unit  circle,  which  reduces  the  time  constant  of  the 
system  and  speeds  convergence.  If  the  prefilter  parameter  "k"  is  set 
to  zero  (i.e.,  no  prefiltering  is  done)  and  the  driving  source  u^  is 
known,  then  the  RML  algorithm  has  the  same  form  as  the  recursive  least 
squares  (RLS)  solution  to  the  problem  of  estimating  the  ARMA  parameters 
[Moore  and  Weiss,  1979].  Morikawa  and  Fujisaka  [1982]  proposed  a 
sequential  ARMA  estimation  (SEARMA)  algorithm  based  on  the  RLS 
recursions. 

Both  the  RML  and  SEARMA  algorithm  use  the  residual  error  to 
estimate  the  driving  source  which  is  assumed  to  be  white.  However,  in 
some  applications  the  input  excitation  is  not  guaranteed  to  be  always 
white,  e.g.,  the  input  can  be  a transient  pulse  as  in  the  EEG  signal  or 
can  be  periodic  pulse  trains  as  in  the  phonation  of  vowel  sounds  in 
the  speech  signal.  The  nonwhite  estimated  input  will  decrease  the 
accuracy  of  the  estimated  parameters  if  the  algorithms  do  not  remove 
the  effect  of  these  input.  Miyanaga  et  al.  [1982]  suggested  using  two 
adaptive  algorithms  to  estimate  the  input  as  well  as  the  ARMA 
parameters  simultaneously.  This  will  increase  the  system  complexity. 
Later,  Miyanaga  et  al  [1986]  proposed  a model  identification  system 
[MIS]  method  which  is  an  extended  form  of  the  Kalman  filtering 
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algorithm.  They  use  a "momentary  likelihood"  function  of  the  variance 
of  the  input  white  noise  and  variance  of  the  parameter  variations  to 
evaluate  the  nonstationarity  of  the  underlying  system.  Both  the  input 
pulse  and  the  parameter  variations  can  be  estimated  by  comparing  the 
power  of  the  prediction  error  to  the  variance  of  the  estimation  error 
since  the  prediction  error  becomes  unexpectedly  large  during  the 
nonstationary  process.  If  the  prediction  error  is  small,  the  input 
excitation  is  assumed  to  be  white  noise  and  the  parameter  variation  is 
neglected.  The  computational  cost  of  this  method  is  less  than  that 
proposed  by  [Miyanaga  et  al.,  1982]. 

The  forgetting  factor  used  in  the  above  algorithms  is  either  set 
to  a constant  value  as  in  the  SEARMA  method  or  varied  according  to 
certain  updating  equations  as  in  the  RML  and  MIS  methods.  Fortescue 
et  al.  [1981]  suggested  that  it  was  better  to  update  the  forgetting 
factor  based  on  the  estimation  error  and  filter  gain,  since  these  two 
factors  can  indicate  the  nonstationarity  of  the  underlying  process. 
The  nonstationary  process  is  generated  either  due  to  system  parameter 
changes  or  due  to  the  irregular  input  excitation  to  the  system. 
Therefore,  a variable  forgetting  factor  based  on  the  estimation  error 
and  filter  gain  may  be  used  to  track  the  parameter  variations  of  a 
nonstationary  process  and  to  estimate  the  input  excitations.  In  the 

remaining  chapters  we  will  derive  a recursive  formula  to  update  the 
variable  forgetting  factor  in  a RLS  based  ARMA  parameter  algorithm  and 
we  will  use  this  formula  to  fast  track  the  time-varying  parameters  of 
both  a synthetic  signal  and  real  speech. 


CHAPTER  3 


THE  DEVELOPMENT  OF  THE  WEIGHTED  RECURSIVE  LEAST  SQUARES 
ALGORITHM  WITH  A VARIABLE  FORGETTING  FACTOR  (WRLS-VFF) 

Introduction 

The  successful  application  of  system  identification  techniques  in 
adaptive  control  motivated  us  to  apply  the  same  techniques  to  signal 
processing.  Of  particular  interest  is  the  fact  that  some  commonly  used 
parameter  estimation  algorithms  such  as  recursive  maximum  likelihood, 
recursive  least  squares  and  least  squares  lattice,  are  capable  of 
estimating  ARMA  parameters,  and  not  just  AR  parameters. 

According  to  Landau  [1976],  recursive  identification  of  the 
dynamic  parameters  of  a process,  as  well  as  tracking  the  process 
parameters  when  they  are  time-varying,  can  be  formulated  as  a model 
reference  adaptive  problem  (see  Figure  2.5).  The  process  (unknown 
system)  to  be  identified  represents  the  reference  model.  The 
adjustable  system  is  constituted  by  an  adjustable  model  (also  called  an 
estimation  model)  having  the  same  structure  as  the  mathematical  model 
of  the  process  whose  parameters  are  controlled  by  an  adaptive  mechanism 
(which  implements  an  identification  algorithm). 

The  application  of  the  adaptive  system  identification  algorithm 
to  ARMA  parameter  estimation  needs  some  assumptions: 

1.  Assume  the  unknown  system  to  be  an  ARMA  process  driven  by  an 


27 


28 


inaccessible  white  noise  process.  This  also  leads  to  an  adaptive 
infinite  impulse  response  (HR)  filter. 

2.  Since  the  input  to  the  estimation  model  is  not  available  at 
each  step,  an  input  estimation  algorithm  is  required  to  estimate  the 
input  before  executing  the  adaptive  process. 

3.  The  variable  forgetting  factor  (VFF)  proposed  in  this  study 
and  the  prediction  error  of  the  adaptive  process  can  be  used  to 
estimate  the  input  excitation  of  the  unknown  system. 

Using  the  weighted  recursive  least  squares  (WRLS)  algorithm  for 
ARMA  parameter  estimation  is  proposed  for  this  study.  Its  block 
diagram  is  shown  in  Figure  3.1. 

Derivation  of  the  WRLS-VFF  Algorithm  for  ARMA  Parameter  Estimation 
Weighted  Recursive  Least  Squares  (WRLS)  Algorithm 

Suppose  the  unknown  system  given  in  Figure  3.1  can  be  modeled  as 
an  ARMA  process,  then  the  output  sequence  yk  is  generated  according  to 
the  following  equation: 

P q 

yk  = - 2^  ai(k)  yk_i  + ^ bj(k)  uk_j  + uk  (3.1) 

where  uk  is  a zero  mean  white  noise  input  sequence,  and  aA  and  bi  are 
the  AR  and  MA  parameters,  respectively.  This  set  of  linear  equations 
describes  the  input-output  relationship  of  a linear  system  with  p poles 
and  q zeros.  The  selection  of  the  model  order  (p,q)  will  be  discussed 
later  in  this  chapter.  Predetermined  values  of  p and  q are  used  in  the 
derivation  of  the  WRLS  algorithm. 

As  mentioned  above,  since  the  measured  signal  yk  depends  on  the 
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Figure  3.1  Block  diagram  of  the  WRLS  algorithm  for  ARMA 
parameter  estimation 
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input  signal  uk,  it  is  necessary  to  estimate  uk  so  that  the  ARMA 
parameters  of  (3.1)  can  be  estimated  accurately  from  yk.  An  estimation 
algorithm  for  uk  based  on  the  variable  forgetting  factor  will  be  given 
in  the  next  sections.  In  this  section,  we  assume  that  an  estimated 
input  uj<._^  (i=0, 1,2,  . . . ,k)  is  already  obtained  at  instant  k. 

Let  us  define  a parameter  vector  an  estimated  parameter 

vector  0^  and  a data  vector  by  the  following  equations: 

0kt  = [a1(k),...,ap(k),b1(k> bq(k) ] (3.2) 

V = [a1(k),...,ap(k),B1(k),...,Bq(k)J  (3.3) 

<^t  = I_yk-1» • • • ’-yk-p>Qk-l »Qk-q]  (3.4) 

where  a superscript  t denotes  transpose,  and  a*  and  bt  are  the 
estimated  ARMA  parameters,  respectively.  Using  (3.2),  (3.3)  and  (3.4), 
the  measured  signal  yk  and  the  estimated  signal  yk  are  expressed  in 
matrix  form  as 


yk  = 4>kt  3c  + uk 

Vk  = V % + % (3.5) 


rk  the  residual  error  of  the  ARMA  process,  namely, 


rk  = Vk  - %. 


(3.6) 


then  a weighted  least  squares  criterion  (cost  function)  can  be  defined 
as  the  weighted  sum  of  squares  of  the  residual  errors,  namely, 
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k 

Vk(0)  = Z X k~1  ri2  (3.7) 

i = l 
or 

k 

Vk(9)  = Z Xk-i  (yi  - 0.)2  (3.8) 

i=l 

where  X is  the  weighting  (forgetting)  factor,  which  is  assumed  to  be  a 
constant  during  the  process  of  adaptation. 

The  problem  now  is  to  obtain  an  estimate  for  0k  that  minimizes 
the  cost  function  Vk(0).  To  see  the  problem  more  clearly,  the 
summation  of  squares  of  the  residual  errors  in  (3.8)  is  rewritten  in 
matrix  form 


_ 

'yf 

' -y0  • • 
-yi 

0o  0 . 

. 0 

/ 

- rk' 

< y^ 

• -yk-i  • 

■-yk-p  ^k-l  • 

• Gk-q^ 

ai(k)' 

ap(k) 

6i(k) 

6q(k), 


Rk  = Yk  “ $kt  hi 


(3.9) 


In  forming  this  equation  we  assumed  that  yk  = 0 for  k < 0.  In  other 
words,  the  data  are  "prewindowed,"  that  is,  multiplied  by  a window 
function  wk,  where  wk  = 0 for  k<0  and  wk  =1  for  k > 1. 

Note  that  minimizing  the  cost  function  Vk(0)  is  equivalent  to 
minimizing  the  weighted  squared  norm  of  (3.8),  since 


k 

Vk( 0)  = Z Xk-i  (Yi  - e.)2 

i=l 

= Rkz  \ Rk 

= II  \1/2Rk  II2 


(3.10) 
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where  \ = diag  [ X^- , X^1  X,l] . The  vector  that  minimizes  this  norm 

is  given  by 

= [*k  \ Vl-1  *k  \ Yk  (3.11) 

This  equation  follows  from  a standard  result  in  linear  algebra 
regarding  the  solution  of  over-determined  sets  of  linear  equations 
[Cowan  and  Grant,  1985].  Equation  3.11  provides,  in  principle,  a 
solution  to  the  adaptive  system  identification  problem.  The  derivation 
of  the  recursive  least  squares  algorithm  for  estimating  the  ARMA 
parameters  is  shown  in  Appendix  A;  here,  we  summarize  these  equations: 

Prediction  error:  = yk  - 4’kt  ®k-l 

Gain  update:  Kk  = Plc_1  ^ [X  + ^ P^  <^]-l 

Parameter  update:  \ = 0k_1  + Kk 

Covariance  matrix:  Pk  = X-1  [Pk_i  + Kk  Pk-1] 

Input  estimate:  uk  = yk  - 0^  (3.12) 

Derivation  of  the  Variable  Forgetting  Factor 

During  the  derivation  of  the  WRLS  algorithm  in  the  previous 
section,  the  variable  forgetting  factor  (X)  was  assumed  to  be  a 
constant,  namely, 

CK  X <1,  X = Xj^  = Xj^_^ 

As  discussed  in  chapter  2,  when  dealing  with  a time— varying  signal 
(e.g.,  speech),  it  is  better  to  use  a VFF  to  enable  the  parameter 
estimate  to  follow  both  slow  and  fast  changes  in  the  signal.  For  a 
locally  stationary  signal  production  process,  the  residual  error  at 
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each  time  k will  indicate  the  state  of  the  estimator.  If  the  error 

signal  is  small,  then  the  forgetting  factor  (FF)  should  be  close  to 
unity;  thus,  the  algorithm  uses  most  of  the  previous  information  in  the 
signal.  If,  on  the  other  hand,  the  error  is  large,  then  a small  FF 
will  decrease  the  weighting  of  the  error.  This  decrease  in  weighting 
of  the  error  signal  will  shorten  the  effective  memory  length  of  the 
estimation  process  until  the  parameters  are  readjusted  and  the  error 
becomes  small.  A procedure  to  achieve  the  proper  error  weighting  by 
choosing  the  appropriate  FF  is  discussed  here.  The  error  information 
of  the  filter  can  be  defined  as  the  weighted  sum  of  the  squares  of 
residual  errors;  this  can  be  expressed  recursively  as  (see  Appendix  B) 

1 Jk  = X 1 Jk-1  + ^k2  d - +k}  Kk>  (3.13) 

A strategy  for  choosing  the  forgetting  factor  A^  may  now  be 
defined  by  requiring  E Ik  to  be  such  that 

1 xk  = 1 Ik-1  -•••  = £!<>  (3.14) 

In  other  words,  the  forgetting  factor  will  compensate  at  each 
step  k for  the  new  error  information  in  the  latest  measurement.  This, 
thereby,  insures  that  the  estimation  is  always  based  on  the  same  error 
information.  Thus  from  (3.13)  by  setting  X = we  have 


\ = 1 ~ ^k2d  - 'f'kt  Kk)  / E I0 


(3.15) 
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Therefore  the  WRLS-VFF  algorithm  can  be  expressed  with  the  same 
equations  used  for  the  VRLS  algorithm  in  (3.12)  while  the  constant 
weighting  factor  X can  be  replaced  by  A^  as  shown  in  (3.15).  The 

effective  memory  of  the  algorithm  can  be  defined  as  [Cowan  and  Grant, 
1985] 

N = 1/d-Xfc)  (3. 16) 

Furthermore,  if  \ becomes  small,  then  the  memory  also  becomes 
small.  But,  practically,  in  some  applications  we  know  we  may  need  a 
certain  memory  size.  In  these  cases,  we  recommend  that  a minimal  A^  be 
defined  as 

Amin  = 1~1/Na*  lf  \ < \nin>  then  \ = \nin  (3.17) 

where  Na=p+q  is  the  total  number  of  the  filter  coefficients  in  (3.1). 
On  the  other  hand,  since  L IQ  is  related  to  the  sum  of  the  squares  of 
the  error,  it  can  be  calculated  from  the  prediction  error  over  one  or 
two  analysis  frames  using  a block  method  before  executing  the  WRLS-VFF 
algori thm. 

Derivation  of  the  WRLS  Algorithm  for  White  Noise  and  Pulse  Input 

In  the  previous  section,  we  assumed  that  the  input  to  the  unknown 
system  is  a zero  mean  white  noise  sequence.  However,  in  some 
applications  other  types  of  input  excitation  exist  (e.g.,  in  a speech 
production  system),  the  input  can  be  either  pulse  trains  for  the  voiced 
sounds  or  white  noise  for  fricative  sounds.  Therefore  a general 
expression  of  the  input  sequence  is  given  as  follows: 
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uk  = ukP  + ukV  (3.18) 

where  ukP  represents  the  pulse  input  and  ukw  is  the  white  noise. 

Now  we  introduce  a weighted  least  squares  criterion  (cost 
function)  to  estimate  the  ARMA  parameters  based  on  the  estimation  error 
ek  (different  from  the  residual  error  rk),  namely,  from  (3.5)  we  have 


ek  = yk  - Yk 

= yk  - ^ 3c  - Uk  (3.19) 

Thus,  the  cost  function  Vk(9)  is  expressed  as 
k 

Vk(0)  = £ X^1  (yi  - 4.^  Qi-  uk)2  (3.20) 

i = l 

From  (3.18)  we  have  £ uk  = £ ukw  + £ ukP,  and  under  the  theorem 
of  ergodicity,  for  large  k,  these  sums  can  be  replaced  by  expectations, 
namely,  £ ukw  = E [ukw]  [Fortescue,  1977].  Miyanaga  et  al.  [1982] 
indicates  that  when  the  input  to  the  reference  model  (unknown  system  in 
Figure  3.1)  is  white,  the  prediction  error  produced  by  an  optimal 
prediction  (e.g. , VRLS  algorithm)  is  also  white.  Thus,  for  a zero  mean 
white  noise  input,  we  have  E[ukv]  = E[ukw]  = 0.  Then  (3.20)  can  be 
rewritten  as 

k 

Vk(9)  = £ X^1  (yi  - ujP)2  (3.21) 

i = l 

The  matrix  representation  of  the  sum  of  the  estimation  error  in  (3.21) 
is  given  as 
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Ek  = Yk  “ V 3c  " (3.22) 

and  hence 

k 

Vk(0)  = E X^-i  (yi  - 0.  _ 0^)2 

i = l 

= I I \l/1  Ek  I I2 

= Ek  \ Ek  (3.23) 

The  vector  that  minimizes  the  cost  function  Vk(0)  can  be  obtained  by 
setting  aVk(9)/30  equal  to  zero  and  by  solving  for  9,  namely, 

3t  = l*k  \ V]-1  \ Yk  - t*k  \ V]-1  *k  \ Ukp  (3.24) 

The  recursive  equation  to  update  the  parameter  estimate  9 is  given  as 
follows  [Appendix  C]: 


= ®k-l  - Kk  (Yk  ~ ‘f’k1  %-l  - ukp)  (3.25) 

Comparing  this  equation  with  (3.12)  we  find  that  the  estimated  input 
pulse  (ukP)  is  subtracted  from  the  new  estimate  (f^.).  Thus  means  that 
the  effect  of  the  periodic  pulse  input  on  the  estimates  can  be  removed 
during  the  process  of  adaptation. 

Input  Estimation 

The  algorithm  for  ARMA  parameter  estimation  in  the  previous 
section  has  been  derived  by  assuming  that  the  input  uk  is  already 
obtained  at  the  time  instant  k.  In  this  section,  different  methods  of 
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estimating  both  the  input  white  noise  and  input  pulse  will  be 
discussed. 

Method  1.  Morikawa  and  Fujisaki  [1982]  and  Friedlander  [1982b] 
used  the  estimated  residual  error  r^  at  the  instant  k as  the  estimated 
input  u^,  namely, 

Qk  = rk  = Yk  - V \ (3.26) 

This  method  is  based  on  the  fact  that  the  driving  source  to  the  ARMA 
process  is  a pure  white  noise.  For  applications  when  the  input 
contains  both  white  noise  and  another  signal  (e.g.,  a pulse),  we  need 
to  estimate  these  signals  to  obtain  more  accurate  estimates. 

Method  2.  Miyanaga  et  al.  [1982]  proposed  a method  to  estimate 
both  white  noise  and  an  input  pulse.  This  method  estimates  the  input 
pulse  by  performing  two  adaptive  algorithms  simultaneously,  each 
placing  different  conditions  on  the  weighting  factor  A.  The  instant  of 
occurrence  of  a pulse  is  determined  as  the  time  when  an  unexpectedly 
large  variation  of  the  prediction  error  power  occurs.  When  the  input 
of  an  ARMA  process  is  white  noise,  the  prediction  error  vj^  produced  by 
an  optimal  prediction  is  also  white  noise.  So  the  input  pulse  train 
and  the  input  white  noise  are  automatically  chosen  by  a threshold  level 
in  the  estimation  algorithm.  Simply  speaking,  algorithm  1 in 
Miyanaga' s method  is  used  to  estimate  the  parameters  and  find  the 
optimal  prediction  error  and  error  covariance  av^;  algorithm  2 is  used 
to  estimate  the  prediction  error  vj^  and  decide  when  the  input  is  a 
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pulse  or  white  noise.  The  estimation  of  a large  variance  ctv^  is  also 
proposed  in  a recent  paper  by  [Miyanaga  et  al.,  1986]. 

Method  3.  The  proposed  input  estimation  method  in  this  study  is 
similar  to  method  2 but  uses  the  forgetting  factor  as  a reference  to 
examine  the  input  condition.  This  can  reduce  the  system  complexity 
since  only  one  adaptive  algorithm  is  used  instead  of  using  the  two 
adaptive  processes  as  in  Miyanaga  et  al.  [1982].  Moreover,  the 
forgetting  factor  X can  be  obtained  from  the  adaptive  process;  no  extra 
calculations  are  required. 

From  (3.15),  it  has  been  shown  that  an  increase  in  the  prediction 
error  results  in  a decrease  in  X^.  A small  value  of  X^  indicates  that 
the  input  has  an  abrupt  change  (e.g.,  pulse).  Hence,  we  can  determine 
when  the  input  is  a pulse  by  comparing  the  X^  with  a threshold  value 
V The  magnitude  of  the  pulse  can  be  approximately  given  by  the 
prediction  error  at  the  estimated  time  of  the  input  pulse  [Miyanaga 
et  al.,  1982].  For  white  noise  input,  the  X^  is  close  to  unity  after 
converge  [Fortescue,  1981].  Under  this  condition  the  residual  error  rk 
of  the  adaptive  process  can  be  used  as  the  estimate  of  the  white  noise 
input  tikw  as  indicated  in  method  1.  The  residual  error  power  can  be 
expressed  as  (B12) 

rk2  = (Ukw)2 


- 

= ?k2  (1  - ^ Kfc) 


(3.27) 
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The  procedure  to  estimate  different  input  excitations  is  given  as 
follows : 

Estimated  input: 

uk  = ukv  + ukP 
Forgetting  factor  update: 

^k2  = yk  - V Qk-l 

\ = i - (1  - ^ Kk)/n0 

Decision: 

a)  Pulse  input 

If  < Xq  then 
ukw  = 0 
uk  = ukP 

= ^ - h.  ®k-i 

b)  Vhite  noise  input 

If  ^ > Xq  then 
ukP  = 0 

°k  = akV 

= ^k  d - *k  Kk>1/2  (3.28) 

WRLS-VFF  Algorithm  with  the  Input  Estimation 
The  overall  WRLS-VFF  algorithm  with  the  input  estimation  is 
summarized  as  follows: 

Prediction  error:  = yk  - <f>kt:  ®k-l 

Gain  update:  Kk  = Pk_!  ^ [ Xk_1  + pk_1  ^j-1 

Forgetting:  \ = 1 - ^ (1  - Kk>  / no 


40 


Input  estimate: 

a)  Pulse  input 

If  < Xq  then 

%w  = o 

uk  = ukP 

= yk  - +k  ®k-l 

b)  White  noise  input 
If  Xk  > Xq  then 
ukP  = 0 

uk  = ukw 

= d - Kk>1/2 

Parameter  update:  ^ = 0|c_^  + Kk  (yk  - 4^ t - ukP) 

Covariance  matrix:  Pk  = Xk~ ^ [Pk_j  + Kk  Pjc_i  1 (3.29) 

Order  Selection  of  the  ARMA  Model 
In  the  previous  section  we  assumed  that  the  order  of  the  ARMA 
process  was  predetermined.  We  discuss  some  properties  of  the  model 
order  and  a practical  approach  to  select  the  order  of  an  ARMA  process 
in  this  section. 

Effect  of  Model  Order  on  Parameter  Estimation 

One  of  the  important  decisions  that  usually  has  to  be  made  a 
priori  in  the  fitting  of  pole-zero  models  is  the  determination  of  an 
"optimal"  number  of  poles  and  zeros.  It  has  been  observed  that  when 
the  order  of  the  poles  is  selected  too  low,  there  will  be  generally  too 
few  model  poles  to  adequately  represent  the  underlying  spectrum.  On 
the  other  hand,  too  high  of  a choice  for  pole  order  will  typically 
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result  in  spurious  effects  (e.g.,  false  peaks)  in  the  estimated 
spectrum.  Similarly,  an  inadequate  selection  of  the  order  of  the  zeros 
will  result  in  inaccurate  estimates  of  the  formant  and  antiformant 
frequencies  and  bandwidths.  With  these  thoughts  in  mind,  investigators 
have  proposed  various  order  selection  procedures  such  as  Akiake's 
information  criterion  (AIC)  [Akaike,  1974],  Rissanen's  information 
criterion  (RIS)  [Rissanen,  1978]  and  relative  variation  error  analysis 
[Vaz  et  al.,  1978]. 

Although  these  procedures  typically  work  well,  they  can  yield 
unsatisfactory  performance  in  some  cases  [Kashyap,  1980].  The  user  is, 
therefore,  cautioned  to  use  discretion  in  applying  the  above  and  other 
model  order  determination  procedures.  The  method  to  be  ultimately  used 
should  be  determined  through  the  above  criteria  and  some  empirical 
experimentation  based  on  the  time  series  related  to  the  specific 
application  under  consideration. 

A Practical  Approach  to  Determine  the  Order  of  an  ARMA  Process 

A practical  approach  to  select  the  ARMA  model  order,  called 
minimal  spectral  distance  error  method,  is  proposed  in  this  study. 
This  method  determines  the  order  of  the  poles  (p)  and  zeros  (q) 
separately  rather  than  simultaneously.  The  search  for  the  optimal 
values  of  p and  q is  conducted  for  the  first  frame  of  N samples  before 
executing  any  parameter  estimation  process.  The  optimal  order  of  the 
poles  is  first  determined  by  either  using  the  AIC  method  or  using  the 
RIS  method. 

The  AIC  method  is  based  on  a maximum  likelihood  estimation  of  the 
probability  density  function  of  the  signal.  The  best  order  is  the  one 
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that  minimizes  a measure  of  the  information,  which  for  a Gaussian 
distribution  and  an  AR  model  of  order  p,  is  given  by 

AIC(p)  = log  V(p)  + 2p/Ne 

V(p)  = Ep/R(0)  (3.30) 

where  Ne  is  the  "effective"  number  of  data  points  in  the  signal  record, 
V(p)  is  the  normalized  error,  Ep  is  the  minimized  total  squared  error 
(residual  energy)  and  R(0)  is  the  estimated  autocorrelation  of  the 
signal  at  time  lag  zero. 

The  formula  used  for  the  RIS  method,  which  is  quite  similar  to 
AIC , is  given  by 

RIS(p)  = log  V(p)  + (p/N)  log  N (3.31) 

After  determining  the  optimal  order  of  the  poles  (p)  for  the  ARMA 
process,  the  order  of  the  zeros  (q)  is  first  set  to  zero  and  then 
increased  by  one  at  each  step  when  executing  the  ARMA  parameter 
estimation  process  with  any  algorithm,  such  as  least  squares  modified 
Yule  Walker  equation  (LSWYWE)  or  VRLS-VFF.  At  each  step,  the  log 
spectral  distance  between  the  smoothed  spectrum  and  the  spectrum 
obtained  from  the  ARMA  model  is  calculated.  The  optimal  order  of  the 
zeros,  q,  is  found  when  the  spectral  distance  is  a minimal  value.  The 
procedure  of  the  minimal  spectral  distance  method  is  summarized  as 
follows : 

(1)  Use  the  first  frame  of  N data  samples  to  estimate  the  model 


order,  p. 
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(2)  Find  the  optimal  order  (p)  of  poles  in  the  ARMA  process  based 

on  the  AIC  or  RIS  method  by  using  the  autocorrelation  method  [Makoul 

and  Gray  1976]. 

(3)  Find  the  ARMA  (p,q)  spectrum  by  using  LSMYVE  method  [Cadzow, 
1982]  with  fixed  p (obtained  from  the  previous  step)  and  variable  q. 

(4)  Find  the  smoothed  spectrum  by  using  the  cepstral  method 
[Rabiner  and  Schafer,  1975]. 

(5)  Find  the  optimal  order  (q)  of  zeros  with  the  minimal  spectral 

distance  error  by  comparing  the  spectral  distance  between  the  smoothed 

spectrum  and  the  ARMA  (p,q)  spectrum. 

Convergence  Analysis  of  the  WRLS-VFF  Algorithm 

The  factors  that  effect  the  convergence  of  the  VRLS-VFF  algorithm 

are 

(1)  unsuitable  model  order, 

(2)  the  degree  of  stationarity  of  the  underlying  signal,  and 

(3)  the  location  and  length  of  the  analysis  interval. 

Assuming  the  proper  model  order  can  be  determined  a priori  and 
that  the  analysis  data  interval  is  large  enough  (this  can  be  done  by 
increasing  the  sampling  frequency),  the  condition  of  convergence  for 
the  the  WRLS-VFF  algorithm  can  be  examined  from  (3.27),  namely, 


rk2  = V*  (1  - V Kk) 

= *4t2  U - *k  Pk-1  *k  / (X  + 'hct  pk-l  *k>] 
= *k2  (1  + X_1  ♦kt  pk-l  'f’k)"1 


(3.32) 
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Then  we  have 

= rk2  (1  + X-1  ^ ?k_!  4^)  (3.33) 

From  the  above  equation  we  find  that  if  1)  the  covariance  matrix  Pk_1 
is  positive  definite,  2)  X is  close  to  one,  and  3)  ^ Pk_1  ^ 

converges  to  zero,  then  the  variance  of  the  prediction  error  ( 
converges  to  the  variance  of  the  residual  error  (rk).  This  is  because 
?k2  = rk2  when  k becomes  infinite,  and  Eff^2]  = E[rk2]  = E[ukw  ukw]  = 

A o o 

CTu,k  > where  crUjkz  is  the  variance  of  the  input  white  noise. 

It  can  be  shown  that  if  the  input  to  a stationary  ARMA  process  is 
white  noise  with  zero  mean  and  variance  ctu2,  the  covariance  matrix  used 
in  the  least  squares  method  is  positive  definite,  since  the  estimated 
input  white  noise  ukw  is  uncorrelated  with  the  measured  signal  yk,  E[yk 
^kWJ=0  [Miyanaga  et  al.,  1982].  The  eigenvalue  of  Pk  converges  to 
zero  with  k -»  °°,  [Makoul,  1976].  Thus, 

lim  Pk  = 0 I,  and  X"1  ^ PR_1  ^ = 0 (3.34) 

k-x» 

where  I is  the  identity  matrix.  The  forgetting  factor  (X)  is  generally 
selected  close  to  one.  Therefore,  the  VRLS  algorithm  converges  to  the 
variance  of  the  residual  error  for  the  stationary  ARMA  process.  But 
for  some  applications  (e.g.,  speech  signal  analysis)  we  cannot 
explicitly  define  the  condition  for  which  the  covariance  matrix  Pk  is 
positive  definite.  This  is  because  the  source  of  the  input  excitation 
is  unknown  (could  either  be  white  noise  or  a pulse)  or  the  process  to 
generate  the  speech  signal  is  nonstationary.  Under  this  situation,  a 
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small  X is  used  to  improve  the  convergence  rate  and  some  other  method 
such  as  closed  phase  analysis  [Krishnamur thy  and  Childers,  1986]  is 
used  to  ensure  that  the  analysis  interval  is  a pure  AR  or  ARMA  process. 

Discussion 

The  main  contributions  of  this  chapter  are  three  fold:  1)  a 
recursive  ARMA  parameter  estimation  algorithm  based  on  the  URLS 
algorithm  with  a variable  forgetting  factor  was  derived,  2)  an  input 
estimation  method  using  the  VFF  was  proposed,  and  3)  a practical  method 
to  determine  the  model  order  based  on  the  spectral  distance  error 
between  the  log  magnitude  spectrum  and  the  estimated  ARMA  spectrum  was 
introduced. 


CHAPTER  4 


IMPLEMENTATION  AND  PERFORMANCE  EVALUATION  OF  THE  WRLS-VFF  ALGORITHM 

Introduction 

The  WRLS-VFF  algorithm  was  implemented  and  experimentally  compared 
to  the  block  data  processing  algorithms  and  other  recursive  algorithms. 
The  performances  of  these  algorithms  were  evaluated  on  synthetic  speech 
signals,  since  we  could  easily  generate  such  signals  with  a formant 
synthesizer  [Klatt,  1980],  while  controlling  selected  parameters,  such 
the  formants,  their  bandwidths,  and  the  synthesizer's  excitation  signal. 
The  results  included  error  analysis,  formant  tracking  contours,  and 
three-dimensional  (3D)  spectral  plots. 

Implementation 

The  implementation  flow  chart  of  the  WRLS-VFF  algorithm  is  shown 
in  Figure  4.1.  The  implementation  procedure  contains  three  steps:  1) 
initialization,  2)  execution  of  the  recursive  equations  for  the  WRLS-VFF 
algorithm,  and  3)  storage  of  the  parameter  estimates  for  output. 

Initialization 

The  selection  of  the  initial  values  of  the  covariance  matrix  PQ, 
the  parameter  estimates  90,  the  error  information  IQ,  the  forgetting 
factor  and  the  threshold  value  Xq  is  described  as  follows: 
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Figure  4.1  Implementation  flow  chart  of  the  WRLS-VFF  algorithm. 
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(a)  The  initial  value  of  PQ  may  be  obtained  simply  by  letting  PQ 
= crl,  where  u is  a large  (positive)  number  [Cowan  and  Grant,  1985]. 
Usually,  small  values  of  P0  will  slow  down  the  rate  of  convergence. 
But  experimental  results  show  that  there  is  no  noticeable  difference  in 
the  rate  of  convergence  between  the  cases  of  Po=100I  and  Po=1000I. 
This  means  that  the  convergence  properties  are  not  affected 
significantly  by  PQ,  as  long  as  it  is  suitably  large  compared  to  the 
variance  of  the  source  signal  [Morikawa  and  Fujisaki,  1982].  This  fact 
is  very  important  for  signals  with  a wide  dynamic  range  (e.g.,  speech). 
In  this  study  Po=100I  was  used  to  analyze  both  synthetic  and  real 
speech. 

(b)  The  initial  estimate  of  the  ARMA  parameter  vector  90  can  be 
set  to  zero.  One  cost  of  this  selection  is  that  more  samples  are 
required  to  obtain  an  accurate  estimate  of  the  parameters  at  the 
beginning  of  the  adaptive  process. 

(c)  The  error  information  EI0  (the  sum  of  the  residual  energy) 
can  be  calculated  from  the  prediction  error  over  one  or  two  analysis 
frames  using  the  linear  predictive  coding  (LPC)  method.  In  this  study 
10^  was  used  for  EI0. 

(d)  The  parameter  determines  the  minimum  effective  memory 
required  in  the  adaptive  process.  This  value  can  be  obtained  from 
(3.17). 

(e)  The  parameter  Ag  is  the  threshold  value  used  for  detecting 
different  input  signals.  It  can  be  determined  experimentally  (e.g., 
Aq=0.1  was  used  for  the  synthetic  speech  used  in  this  chapter). 
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Execution 

The  algorithm  skips  p samples  (pole  order)  at  the  beginning  of  an 
analysis  interval  in  order  to  fill  the  elements  of  data  vector  (j^,  then 
we  set  <|>[^  = (fp  = [ Yl  > • • » Yp > uj  > • • ? dq  ] = [yj  , • • >yp>0>  • • >0]  • Starting 
from  k=p+l,  the  estimate  9^  of  0^  is  updated  by  (3.29)  for  each  instant 
k. 


Parameter  Output 

The  parameters  are  updated  at  every  sample  during  the  adaptive 
process.  The  final  estimates  of  the  ARMA  parameters  are  stored 

according  to  either  a predetermined  range  of  analysis  intervals  or  a 
checking  criteria,  i.e., 

(1)  The  estimation  error  power  e^2  is  below  a pre-specif ied 

value. 

(2)  The  norm  of  the  parameter  estimate  error  is  below  a threshold 
value  [Morikawa  and  Fujisaki,  1982]. 

The  selection  of  the  storing  interval  for  analysis  usually 
depends  on  the  underlying  signals  and  empirical  information.  For 
example,  in  the  speech  signal  analysis,  the  analysis  interval  can  be 
set  within  one  pitch  period  (also  called  pitch-synchronized  analysis) 
[Rabiner  and  Schafer,  1975]  or  during  the  closed  phase  interval 
[Krishnamurthy  and  Childers,  1986].  The  latter  method  will  be 
discussed  in  the  next  chapter. 

Generation  of  Synthetic  Speech  Signals 
Digital  Resonator 

The  generation  of  synthetic  speech  signals  is  based  on  a digital 
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formant  synthesizer.  The  basic  building  block  of  the  synthesizer  is  a 
digital  resonator.  Two  parameters  are  used  to  specify  the  input-output 
characteristics  of  a resonator,  the  resonant  (formant)  frequency  F,  and 
the  resonance  bandwidth  B.  The  transfer  function  of  a digital 
resonator  can  be  defined  using  the  z-transform  as  [Gold  and  Rabiner, 
1968]: 


Ak(z) 


1 - 2rpk.cos(6plc)  + rpk2 
1 " 2rpkcos(epk)z_1  + rpk2z'2 


(4.1) 


where  rpk  = exp(-rtBpjc/Fs) , 9pk  = 2rtFpk/Fs,  Fs  is  the  sampling 
frequency,  and  Fpk  and  Bpk  are  the  frequency  and  bandwidth, 

respectively,  of  the  k1*1  resonator. 

An  antiresonator  (also  called  an  antiformant)  is  the  mirror  image 
of  the  resonator.  The  transfer  function  of  an  antiresonator  can  be 
expressed  as 


1 - 2rzkcos(9zlc)z-1  + rzk2z~2 

Bk(z)  = (4.2) 

1 - 2rzkcos(9zk)  + rzk2 


where  rzk  = exp(-rcBzk/Fs)  and  0zk  = 2rtFzk/Fs,  and  Fzk  and  Bzk  are  the 
frequency  and  bandwidth,  respectively,  of  the  kth  antiresonator. 

In  order  to  generate  the  synthetic  speech  signal  with  multiple 
resonance  frequencies,  the  digital  resonators  are  connected  in  cascade. 
The  transfer  function  of  the  cascade  resonator  is  given  as 


n 

H(z)  = n Ak(z)  (4.3) 

k=l 


where  n is  the  number  of  resonance  frequencies.  When  the  synthetic 
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speech  signal  consists  of  both  resonance  and  antiresonance  (transfer- 
function  zero  pair)  frequencies,  the  combined  transfer  function  H(z) 
can  be  expressed  as 

n m 

H(z)  = n Ak(z)  n Bk(z)  (4.4) 
k=l  k=l 

where  m is  the  number  of  antiresonators.  The  impulse  response  h(n)  of 
the  cascade  resonators  can  be  obtained  by  taking  the  inverse  fast 
Fourier  transform  (IFFT)  of  H(z).  The  synthetic  speech  signal  was 
generated  by  convolving  the  resonator  impulse  response  h(n)  with  the 
input  excitation  u(n).  In  this  study,  either  an  impulse  train  or  the 
Fant  model  [Fant,  1979]  was  used  for  u(n),  depending  on  the  signal  to 
be  generated. 

Signal  Generation 

The  formant  and  bandwidth  used  to  generate  synthetic  speech 
signals  are  listed  in  Table  1 [Klatt,  1980].  The  input  to  the 
resonator  is  a pulse  train  with  a period  (T0)  of  10  ms.  The  waveforms 
and  spectra  of  synthetic  speech  signals  SYN1  (/i/),  SYN2  (/a/)  and  SYN3 
(/m/)  are  shown  in  Figures  4.2,  4.3,  and  4.4,  respectively.  In  order 
to  examine  the  performance  of  each  adaptive  algorithm,  a simulated 
diphthong  /a/-/i/  and  nasal  sound  /m/-/i/  were  generated.  The  block 
diagram  to  generate  these  signals  is  shown  in  Figure  4.5.  For  the 
generation  of  the  diphthong  (SYN4),  h^(n)  and  h2(n)  represent  the 
impulse  response  of  resonators  / i / and  /a/,  respectively.  For  the 
generation  of  the  nasal  sound  (SYN5),  h^(n)  and  h2(n)  represent  the 
impulse  response  of  the  resonators  /m/  and  / i / , respectively.  For  the 
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Table  4.1  Formant  (F)  and  bandwidth  (B)  used  for  generating 

synthetic  speech  signals  (all  frequencies  are  in  Hz). 


signal 

FI 

F2 

F3 

B1 

B2 

B3 

FZ1 

BZ1 

/m/ 

390 

1250 

2150 

60 

150 

200 

780 

80 

/i/ 

270 

2290 

3010 

75 

95 

120 

- 

- 

/a/ 

730 

1090 

2240 

60 

85 

110 

- 

- 
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diphthong,  the  vowel  / i / was  converted  to  /a/  using  a linear  formant 
variation  as  shown  in  Figure  4.6.  The  reason  for  using  a discrete 
number  to  represent  the  formants  in  this  figure  is  that  these  formants 
were  obtained  sequentially  from  each  analysis  frame.  The  same  approach 
was  used  for  the  nasal  sound  as  shown  in  Figure  4.7.  The  transition 
between  vowels  or  between  nasal  and  vowel  is  accomplished  by  letting 

the  gain  factor  "a"  (shown  in  Figure  4.5)  vary  linearly  from  1.0  to  0.0 

over  the  interval  from  sample  301  to  700.  The  generation  procedures 
are  illustrated  by  the  following  equation: 

s(n)  = a sj(n)  + (1-a)  S2(n) 

n = 1 to  300,  a = 1 

n = 301  to  700  a = 1 - (n-300)/400 

n = 701  to  1000  a = 0 (4.5) 

where  n is  the  data  point  and  s^(n)  and  S2(n)  are  the  output  of  the 
resonators.  The  sampling  frequency  was  10  KHz,  and  a total  of  1000 
sample  data  points  was  generated. 

Performance  Evaluation 

The  performance  of  the  VRLS-VFF  algorithm  was  based  on  the 
following  items: 

1.  Pole-zero  estimation. 

2.  Formant/Antiformant  tracking. 

3.  Input  cancellation. 

4.  Effect  of  noise  on  spectral  estimation. 

5.  Complexity  analysis  of  the  WRLS-VFF  algorithm. 
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(b) 

Figure  4.2  Synthetic  speech  signal  SYN1  /i/i  (a)  waveform 
and  (b)  log  magnitude  spectrum. 
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(a) 


(b) 

Figure  4.3  Synthetic  speech  signal  SYN2  /a/:  (a)  waveform 
and  (b)  log  magnitude  spectrum. 
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(a) 


(b) 

Figure  4.4  Synthetic  speech  signal  SYN3  /m/:  (a)  waveform 
and  (b)  log  magnitude  spectrum. 
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Figure  4.5  Block  diagram  for  generating  synthetic  speech  signals 
(1)  diththong  /a/-/i/  and  (2)  nasal  /m/-/i/. 
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Figure  4.6  Synthetic  speech  signal  SYN4  /i/-/a/:  (a)  waveform 
and  (b)  formant  tracks. 
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(a) 


(b) 

Figure  4.7  Synthetic  speech  signal  SYN5  /m/-/i/:  (a)  waveform 
and  (b)  formant  tracks. 


60 


Pole-Zero  Estimation 

The  pole-zero  spectral  analysis  techniques  can  be  categorized,  as 
mentioned  in  chapter  two,  into  three  groups:  1)  iterative  methods,  2) 
two-stage  least  squares  modified  (also  called  over-determined)  Yule- 
Walker  equation  (LSMYWE)  methods,  and  3)  recursive  least  squares  (RLS) 
methods.  Since  the  results  of  a spectral  analysis  technique  are  data 
dependent,  one  approach  may  yield  poor  results  for  one  data  set  but  be 
very  good  for  another.  Further,  the  "results"  desired  may  vary  from 
user  to  user.  One  analyst  may  desire  high  resolution,  while  another  may 
desire  robust  performance  for  various  signal-to-noise  ratios. 
Therefore,  the  comparison  of  different  ARMA  spectral  analysis  techniques 
cannot  be  based  only  on  the  error  criterion  used  in  the  development  of 
these  algorithms,  but  other  criteria  might  be  considered  as  well  in  real 
applications,  i.e.,  1)  real  time  computation,  2)  automatic  updating,  3) 
spectral  matching,  or  4)  minimal  bias  and  variance.  For  speech  signal 
analysis,  we  considered  both  the  estimation  spectral  peaks  (formants) 
and  null  (antiformants)  to  be  necessary,  since  we  would  achieve  better 
synthesis  of  particular  utterances  [Atal,  1978;  Childers  et  al,  1981]. 
Moreover,  an  accurate  spectral  matching  of  the  original  speech 
production  model  is  also  essential  for  the  development  of  speech  and 
speaker  recognition  systems.  In  this  section,  a synthetic  speech  signal 
/m/  was  analyzed  and  compared  with  several  algorithms  according  to  the 
following  criteria: 

(1)  Formant/antiformant  estimation. 

(2)  Spectral  matching  (root  mean  square  spectral  error). 

For  comparison  purposes  we  selected  a representative  algorithm  from  the 
iterative,  LSMYWE,  and  RLS  methods.  For  example,  the  iterative  inverse 
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filtering  (ITIF)  method  [Konvalinka  and  Matausek,  1979 J,  two-stage 
LSMYVE  method  [Cadzow,  1982;  Kay,  1987],  LPC  method  [Makhoul,  1976],  as 
well  as  the  proposed  WRLS-VFF  algorithm  were  examined  in  this  study. 

In  order  to  quantitatively  evaluate  the  estimation  error  of  pole 
and  zero  frequencies,  the  following  criteria  were  used: 

m 

Ep  = E (AF pi/F pi)  (4.6) 

i=l 

and 

n 

E = E (AFzi/Fzi)  (4.7) 

i = l 

where  m and  n are  the  number  of  the  poles  and  zeros,  respectively,  of 
the  ARMA  model.  Fp^  is  the  pole  frequency,  AFp^  is  the  estimation  error 
of  its  frequency,  Fz^  is  the  zero  frequency,  and  AFZ^  is  the  estimation 
error  of  its  frequency.  The  pole  and  zero  frequencies  are  determined 
by  solving  for  the  roots  of  the  polynomial  whose  coefficients  are  the 
AR  parameters  and  MA  parameters,  respectively. 

The  magnitude-squared  spectra  of  the  AR  and  ARMA  model  can  be 
obtained  by  the  following  equations  [Makhoul,  1976]: 

AR  model: 


G2 


|1 


P 

+ E 
k=l 


ak 


e-jak 


|H(co)  |2  = 


(4.8) 
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ARMA  model : 


|H(co)  |2  = 


Q • i. 

+ Z e 
k=l 


2 


I 1 


P 

+ 2 ak 
k-1 


-jok  |2 


(4.9) 


There  are  a number  of  types  of  error  criteria  to  evaluate  how 
close  the  estimated  spectrum  |H(w)  |2  is  to  the  original  spectrum 
|H(v)  |2,  where  |H(w)  |2  and  |H(w)  |2  are  the  magnitude  squared  of  the 
reference  and  estimated  spectra,  respectively.  In  discrete  form,  the 
root  mean  squares  (RMS)  spectral  error  is  calculated  by  [Schmid,  1984; 
Pinto,  1987] 


N-l 

E(DB)  = sqr t { (1/N)  L [10  log  ^(wj)  |2  - 10  log  ^(wi ) |2 ] 2) 

i=o  (4.10) 

In  many  applications  (e.g.,  speech  signal  analysis),  we  are  interested 
mainly  in  the  spectral  errors  for  the  peak  and  valley  regions  of  the 
signal  spectrum.  Therefore,  a weighting  on  the  spectral  error  over  the 
frequency  range  of  interest  is  used  for  the  analysis.  Ve  are  also 

interested  in  matching  the  gain  factor  for  the  estimated  and  original 
spectra  for  comparison  reasons.  This  is  achieved  by  adding  an  offset 
gain  to  each  estimated  spectrum  in  the  RMS  spectral  error  analysis. 

Figure  4.8  shows  the  AR  (all-pole)  spectrum  for  the  synthetic 
signal  SYN3  using  the  LPC  covariance  method  [Makhoul,  1976].  The  model 
order  was  selected  as  IP=8  and  IP=10.  In  principle,  the  spectral 

properties  of  a zero  can  be  approximated  with  arbitrary  precision  by 

additional  poles  in  the  all-pole  model  [Atal  and  Hanauer,  1971].  Such 
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an  approximation,  however,  is  likely  to  be  inefficient  and  impractical. 
Furthermore,  it  is  difficult  to  produce  a local  dip  in  the  spectrum  by 
using  additional  poles  without  introducing  ripples  at  other  frequencies 
in  the  spectrum,  and  a large  RMS  spectral  error  is  also  obtained  as 
shown  in  Table  4.2B. 

Figures  4.9,  4.10,  and  4.11  are  the  estimated  ARMA  spectra  for 
the  synthetic  speech  signal  SYN3  using  the  ITIF,  LSMYVE,  and  VRLS-VFF 
methods,  respectively.  Comparing  the  experimental  results  in  Table 
4.2A  and  4.2B  shows  that  the  ITIF  and  LSMYVE  methods  give  reasonable 
estimates  of  the  formants  and  bandwidth,  but  due  to  the  block 
processing  approach  used  in  these  two  methods,  the  periodic  excitation 
pulses  contained  in  each  block  affects  the  accuracy  of  the  estimated 
formant  and  bandwidth.  However,  the  VRLS-VFF  method,  which  eliminates 
the  influence  of  the  pulse  excitation  interval  (pitch  period)  by  using 
an  input  estimation  approach,  gives  very  accurate  formant/bandwidth 
estimates  and  good  spectral  matching  (small  RMS  spectral  error). 

The  same  test  signal  SYN3  was  used  to  examine  the  performance  of 
different  recursive  ARMA  parameter  estimation  algorithms,  i.e.,  1) 
sequential  estimation  ARMA  (SEARMA)  [Morikawa  and  Fujisaki,  1982], 
which  is  similar  to  a Kalman  filter  algorithm  [Gibson,  et.  al.,  1975], 
2)  WRLS  [ Friedlander , 1982],  3)  modified  VRLS  [Fortescue  et  al,  1981], 
and  4)  the  proposed  VRLS-VFF  method.  The  test  conditions  (forgetting 
factor  A and  input  estimation  method)  and  the  results  for  the  synthetic 
speech  signal  SYN3  using  these  methods  are  shown  in  Table  4.3A,  which 
shows  that  the  VRLS-VFF  method  gives  the  minimal  estimation  error  in 
terms  of  the  formant  and  antiformant  frequencies.  The  reason  is  that 
the  VRLS-VFF  method,  which  uses  the  variable  A and  the  proposed  input 
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Table  4.2A  Formant  (F)  and  handwidth  (B)  estimation  results  for 
the  synthetic  speech  signal  SYN3. 


Formant  freq. 
and  BV  (Hz) 

LPC 

IP=8 

LPC 

IP=10 

ITIF 

IP=6  IQ=2 

LSMYWE 
IP=6  IQ=2 

WRLS-VFF 
IP=6  IQ=2 

Fl/Bl 

390/60 

394/198 

379/64 

396/127 

381/42 

390/60 

F2/B2 

1250/150 

1435/300 

1333/138 

1286/267 

1291/173 

1261/199 

F3/B3 

2150/200 

2201/41 

2126/128 

2155/215 

2145/145 

2151/200 

Fz/Bz 

780/80 

N/A 

N/A 

740/381 

784/144 

776/144 

N/A  - Not  Applicable 


Table  4.2B  Estimation  error  analysis  for  the  different  spectral 
analysis  methods  for  the  synthetic  speech  signal  SYN3. 


Error 

Analysis 

LPC 

IP=8 

LPC 

IP=10 

ITIF 

IP=6  IQ=2 

LSMYWE 
IP=6  IQ=2 

WRLS-VFF 
IP=6  IQ=2 

Total  Formant 
Estimation 
Error  Ep  (X) 

18.0 

10.5 

4.65 

5.82 

.926 

Antiformant 
Estimation 
Error  Ez  (%) 

N/A 

N/A 

5.12 

.512 

.515 

Spectral 
Distance 
Error  (DB) 

4.67 

3.39 

2.11 

1.13 

.708 

N/A  - Not  Applicable 
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Figure  4.8  Log  magnitude  spectrum  for  the  synthetic  signal  SYN3 

using  the  LPC  covariance  method:  (1)  estimated  spectrum 
for  IP=12  (upper),  (2)  estimated  spectrum  for  IP=8 
(middle),  and  (3)  original  spectrum  (lower). 


Figure  4.9  Log  magnitude  spectrum  for  the  synthetic  signal  SYN3 
using  the  ITIF  method:  (1)  original  spectrum  (upper) 
and  (2)  estimated  spectrum  (lower). 
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Figure  4.10  Log  magnitude  spectrum  for  the  synthetic  signal  SYN3 

using  the  LSMYWE  method:  (1)  original  spectrum  (upper) 
and  (2)  estimated  spectrum  (lower). 


Figure  4.11  Log  magnitude  spectrum  for  the  synthetic  signal  SYN3 
using  the  VRLS-VFF  method:  (1)  original  spectrum 
(upper)  and  (2)  estimated  spectrum  (lower). 
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estimation  method,  can  reduce  the  estimation  error  when  the  signal 
changes  abruptly  at  the  pitch  excitation  position.  The  RMS  spectral 
error  of  the  WRLS-VFF  method  is  a little  larger  than  that  of  the  SEARMA 
and  Modified  WRLS  methods.  The  reason  is  that  the  estimated  bandwidth 
of  the  second  formant  is  a little  wider.  However,  this  is  less 
important  since  the  formant  and  antiformant  frequencies  are  the  key 
factors  in  generating  the  synthetic  speech  signal  [Flanagan,  1972]. 

Formant/Antiformant  Tracking 

The  ability  to  track  a spectral  resonance  in  time-varying  signals 
has  many  applications.  In  speech  signal  processing,  accurately 
tracking  speech  parameters  such  as  vocal  tract  resonance  frequencies 
(formants)  and  their  bandwidths  is  necessary  in  many  speech  analysis/ 
synthesis  applications.  In  such  cases,  a good  formant  tracker  can 
often  give  better  information  about  consonant-vowel  transitions  and 
formant  movements  in  the  production  of  diphthongs.  Several  effects 
limit  the  accuracy  with  which  formant  frequencies  and  bandwidths  can  be 
estimated  from  the  spectrum  of  an  actual  speech  signal.  We  will 
discuss  these  effects  when  we  apply  the  WRLS-VFF  algorithm  to  real 
speech  analysis  in  the  next  chapter.  In  this  section,  two  time-varying 
synthetic  signals  SYN4  and  SYN5 , were  used  to  test  the  formant  tracking 
ability  of  each  recursive  algorithm.  Since  the  exact  formant  contours 
of  the  test  signals  are  known,  the  performance  of  each  algorithm  can  be 
easily  verified  by  comparing  the  test  results  with  the  original  formant 
tracks . 

Figures  4.12  (a)  and  (b)  show  the  original  formant  tracks  and  the 
three-dimensional  (3D)  log-magnitude  spectrum  for  the  synthetic  speech 
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signal  SYN4.  This  signal  contains  three  formants  and,  therefore,  a 6th 
order  AR  model  was  used  for  each  recursive  algorithm. 

Figures  4.13  to  4.16  show  the  formant  tracks  and  the  log- 
magnitude  spectra  for  the  synthetic  speech  signal  SYN4  using  the 
SEARMA,  URLS,  Weighted  Least  Squares  Lattice  (ULSL)  [Lee  et  al,  1981], 
and  WRLS-VFF  algorithms,  respectively.  The  3D  spectrum  of  the  SEARMA 
algorithm  (Figure  4.13(b))  shows  a smoothed  out  spectrum  after  the 
transition  region  (i.e.,  between  /i/  and  /a/).  This  is  due  to  the 
effect  of  the  estimator's  "infinite  memory"  (\=1)  which  obscures  the 
changing  signal  characteristics.  Although  in  Figure  4.13(a)  the 
formants  can  be  detected,  the  value  of  each  formant  and  bandwidth  are 
quite  different  from  the  original  ones.  For  example,  the  estimated 
formants  and  bandwidths  are 

Fl=831Hz , Bl=294Hz , F2=1124Hz,  B2=965Hz,  F3=2728Hz,  B3=275Hz, 
while  the  original  formants  and  bandwodths  are 

Fl=730Hz , Bl=60Hz , F2=1090Hz,  B2=85Hz,  F3=2240Hz,  B3=110Hz. 

In  contrast,  each  of  the  "weighted"  recursive  algorithms  (i,e.,  URLS, 
ULSL  and  WRLS-VFF),  detect  the  transition  region.  However,  the 
constant  weighting  of  the  WRLS  and  ULSL  algorithms  gives  a larger 
formant/bandwidth  estimation  error  after  the  transition  than  that 
obtained  in  the  WRLS-VFF  algorithm.  This  is  because  the  weighting 
factor  controls  the  effective  data  window  used  for  the  adaptive  process 
[Cowan  and  Grant,  1985].  A small  data  window  in  the  nonstationary 
transition  region  is  desirable  in  order  to  decrease  the  weighting  of 
the  error,  on  the  other  hand,  after  the  transition  interval,  a larger 
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data  window  is  desirable  since  the  signal  is  stationary.  This  can  only 
be  done  with  the  VRLS-VFF  algorithm,  whereby  the  VFF  varies  according 
to  signal  conditions. 

Table  4.3B  shows  that  the  average  RMS  spectral  error  (total  RMS 
spectral  error  divided  by  the  total  analysis  frames)  obtained  from  the 
WRLS-VFF  method  of  the  synthetic  speech  signal  SYN4  is  less  than  that 
obtained  from  other  resursive  methods. 

Figures  4.17  and  4.18  show  the  time-varying  spectral  plots  for 
the  synthetic  speech  signal  SYN5  (/m/-/i/>  using  the  SEARMA,  VRLS, 
VRLS-VFF  (method  1 input  estimation)  and  VRLS-VFF  (method  2 input 
estimation)  algorithms,  respectively.  These  spectra  were  estimated 
using  an  ARMA  (6,2)  model  and  plotted  for  every  55  samples.  The  reason 
this  updating  interval  was  selected  was  to  prevent  any  plotted  value 
from  appearing  at  the  occurrence  of  an  excitation  (pitch)  pulse  during 
the  analysis.  The  excitation  pulses  were  located  at  100  samples  for 
2000  data  points. 

From  Figure  4.17  (a),  we  found  that  the  SEARMA  algorithm  can 
track  the  spectral  zero,  but  cannot  track  the  spectral  poles  after  the 
transition  /m/-/i/  due  to  the  use  of  a constant  weighting  factor  ()v=l). 
In  Figure  4.17  (b),  the  VRLS  algorithm  (A=.91)  cannot  track  the 
spectral  zero  at  the  begining  of  the  process.  However,  Figures  4.18 
(a)  and  (b)  show  the  spectral  poles  and  zero  can  be  tracked  very 
accurately  using  the  VRLS-VFF  algorithm  using  different  input 
estimation  methods.  Figures  4.19  and  4.20  show  the  formant  and 
antiformant  tracks  of  the  VRLS-VFF  algorithm  using  different  input 
estimations,  respectively. 
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Table  4.3A  Comparison  of  different  recursive  algorithms  for  ARMA 

parameter  estimation  for  the  synthetic  speech  signal  SYN3. 


Algorithms 

SEARMA 

1 

WRLS 

Modified 

WRLS 

WRLS-VFF 

Forgetting 
Factor  (FF) 
X 

X=1.0 

X?=.  99 

\iin=  • 99 
variable  X^ 

Xq=0. 1 
Xmin=0. 99 
variable  X^ 

Input 

Estimation 

Method  1 

Method  1 

Method  1 

Method  3 

Formant 
Estimation 
Error  Ep  (Z) 

1.44 

1.20 

1.47 

.926 

Antiformant 
Estimation 
Error  Ez  (%) 

1.28 

8.0 

1.15 

.515 

RMS  Spectral 
Error  (DB) 

.607 

1.65 

.630 

.708 

Table  4.3B  Comparison  of  different  recursive  algorithms  for  ARMA 

prameter  estimation  for  the  synthetic  speech  signal  SYN4. 


Algorithms 

— 

SEARMA 

URLS 

WLSL 

WRLS-VFF 

Forgetting 
Factor  (FF) 
X 

X=1.0 

X=.  93 

X=.93 

Xq=0. 1 
Xmin=0. 93 
variable  \ 

Average  RMS 
Spectral 
Error  (dB) 

5.48 

2.37 

2.69 

2.10 

FREQUENCY  (KHZ) 
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Figure  4.12  (a)  Formant  tracks  of  SYN4. 


Figure  4.12  (b)  Log  magnitude  spectrum  (3D)  of  SYN4, 


FREQUENCY  (KHZ) 
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Figure  4.13  (a)  Formant  tracks  of  SYN4  using  the  SEARMA  method. 


Figure  4.13  (b)  Log  magnitude  spectrum  (3D)  of  SYN4  using 

the  SEARMA  method. 


FREQUENCY  (KHZ) 
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Figure  4.14  (a)  Formant  tracks  of  SYN4  using  the  WRLS  method. 


Figure  4.14  (b)  Log  magnitude  spectrum  (3D)  of  SYN4  using 

the  WRLS  method. 


FREQUENCY  (KHZ) 
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Figure  4.15  (a)  Formant  tracks  of  SYN4  using  the  WLSL  method. 


Figure  4.15  (b)  Log  magnitude  spectrum  (3D)  of  SYN4  using 

the  WLSL  method. 


FREQUENCY  (KHZ) 
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Figure  4.16  (a)  Formant  tracks  of  SYN4  using  the  WRLS-VFF  method. 


Figure  4.16  (b)  Log  magnitude  spectrum  (3D)  of  SYN4  using 

the  WRLS-VFF  method. 


AMPLITUDE  (DB) 
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Figure  4.17  (a)  Spectral  plots  of  SYN5  using  the  SEARMA  method. 


Figure  4.17  (b)  Spectral  plots  of  SYN5  using  the  VRLS  method. 


AMPLITUDE  (DB) 
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Figure  4.18  (a)Spectral  plots  of  SYN5  using  the  WRLS-VFF  method 
with  method  1 input  estimation. 


Figure  4.18  (b)  Spectral  plots  of  SYN5  using  the  VRLS-VFF  method 

with  method  2 input  estimation. 


FREQUENCY  (KHZ) 
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Figure  4.19  (a)  Formant  tracks  of  Figure  4.17(a). 
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Figure  4.19  (b)  Antiformant  tracks  of  Figure  4.17(a) 
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Figure  4.20  (a)  Formant  tracks  of  Figure  4.17  (b). 
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Figure  4.20  (b)  Antiformant  tracks  of  Figure  4.17(b) 
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Input  Cancellation 

The  parametric  estimation  approach  for  spectral  analysis  is  a 
method  that  estimates  a spectral  envelope  of  the  underlying  signal.  If 
signals  to  be  analyzed  consist  of  short  periodic  pulses,  the  spectrum 
cannot  be  estimated  accurately  due  to  the  effect  of  the  fundamental 
frequency  (F0)  and  its  harmonics  [Miyanaga  et  al,  1982].  For  example, 
in  speech  production,  if  the  fundamental  frequency  (F0)  of  the  input 
excitation  (pitch)  pulse  is  close  to  the  first  formant  (as  occurs  with 
some  female  and  children's  voices),  then  the  parametric  methods,  such 
as  linear  predictive  coding  (LPC),  cannot  adequately  estimate  the 
characteristics  (formants  and  bandwidths)  of  the  vocal  tract  filter 
[Huang  et  al.,  1987].  In  order  to  estimate  the  formants  and  bandwidths 
accurately,  it  is  necessary  to  eliminate  the  influence  of  the 
excitation  pulse  period  from  the  spectra  of  the  speech  signals.  The 
WRLS-VFF  (ARMA)  algorithm  estimates  the  input  from  the  measured  signal 
and  uses  this  input  as  well  as  the  measured  output  to  estimate  the 
unknown  parameters,  thus,  cancelling  the  influence  of  the  input 
excitation.  An  experimental  test  to  calculate  the  estimation  error  of 
the  first  formant  due  to  the  input  excitation  is  discussed  as  follows: 

A synthetic  signal  /i/  with  the  same  formant  and  bandwidth  as  in 
Table  4.1  but  with  different  fundamental  frequencies  was  used  to 
investigate  the  effect  of  the  excitation  (pitch)  pulse  on  the  formant 
estimation.  The  fundamental  frequency  (FQ)  was  varied  from  100  Hz  to 
200  Hz.  The  sampling  frequency  was  10  KHz,  the  analysis  data  window 
used  for  the  standard  LPC  method  was  256  points,  and  the  parameter 
update  point  for  URLS  and  VRLS-VFF  was  k=256.  The  results  in  Table  4.4 
show  that  for  the  LPC,  URLS  and  VRLS-VFF  (AR)  methods,  the  average 


81 


error  rate  of  FI  equaled  20. 7%,  but  the  error  rate  for  WRLS-VFF  (ARMA) 
was  negligible. 


Table  4.4  First  formant  estimation  using  different  algorithms  for 
the  synthetic  speech  signal  SYN6  (variable  fundamental 
frequency  F0). 


Fo 

(Hz) 

Reference 

Formant 

LPC 

IP=6 

URLS 

X-1.0 

WRLS-VFF 
)«=  .91  AR 

WRLS-VFF 
A^.91  ARMA 

100 

270 

275 

274 

274 

270 

120 

270 

259 

260 

260 

270 

140 

270 

276 

276 

275 

270 

160 

270 

285 

284 

283 

270 

180 

270 

265 

264 

265 

270 

200 

270 

244 

246 

245 

269 

Average  Error  Rate 

20.1% 

20.0% 

20.1% 

0.3% 
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Effect  of  Noise  on  Spectral  Estimation 
The  AR  spectral  estimates,  using  the  VRLS-VFF  algorithm 
discussed  in  the  previous  section,  have  been  shown  to  possess  excellent 
resolution  properties  for  the  analysis  of  noise-free  signals.  However, 
the  addition  of  white  noise  to  an  AR  process  results  in  an  inaccurate 
power  spectral  density  estimate  [Tierney,  1980].  The  reason  for  the 
degradation  of  the  spectral  estimate  in  the  presence  of  noise  is  that 
the  AR  assumption  that  the  process  can  be  represented  by  the  output  of 
an  all-pole  filter  excited  by  white  noise  is  no  longer  valid.  Thus, 
the  lower  the  signal-to-noise  ratio  (SNR),  the  more  the  "all-pole" 
assumption  is  violated,  and  the  poorer  the  resulting  the  spectral 
estimate  [Kay,  1979],  One  method  to  combat  the  effect  of  noise  on  the 
AR  spectral  estimate  is  to  increase  the  model  order.  This  is  because 
the  appropriate  model  for  an  AR  process  in  white  noise  is  an  ARMA  model 
and  an  ARMA  model  can  be  approximated  by  a high  order  AR  model  [Pagano, 
1974].  Another  method  for  the  spectral  estimation  of  noisy  signals  is 
to  use  an  ARMA  estimation  algorithm  directly.  In  this  section,  we 
estimate  the  AR  spectrum  of  a synthetic  speech  signal  corrupted  by 
white  noise  using  the  WRLS-VFF  algorithm  and  compare  the  results  with 
those  obtained  from  other  estimation  algorithms  such  as  LPC  and  SEARMA 
methods  applied  to  the  same  signals. 

Experimental  Results 

To  evaluate  quantitatively  the  accuracy  of  various  algorithms 
with  different  model  orders  applied  to  the  spectral  estimation  of  the 
noisy  signal,  we  use  the  formant  estimation  error  criterion  as  in 
(4.5).  The  additive  white  noise  was  generated  by  a Gaussian  random 
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number  generator  with  zero  mean  and  varying  variance,  depending  on  the 
required  SNR. 

Figures  4.21  (a),  (b),  and  (c)  show  the  synthetic  speech  signal 
SYN2  corrupted  by  noise  with  SNRs  = 0 dB,  10  dB,  and  20  dB, 
respectively. 

Table  4.5  summarizes  the  test  results  obtained  from  LPC,  SEARMA 
(Kalman),  and  VRLS-VFF  methods  for  the  spectral  estimation  of  noisy 
synthetic  speech  with  different  SNRs.  As  we  can  see,  the  VRLS-VFF 
method  performs  better  in  terms  of  the  accuracy  of  formant  estimation 
than  the  LPC  and  SEARMA  methods  at  high  SNR.  However,  at  low  SNR,  all 
methods  fail  to  estimate  the  AR  spectrum  accurately.  Figure  4.22  gives 

a picture  of  the  effect  of  noise  on  the  spectral  estimate  using  the 
VRLS-VFF  method. 

Next,  we  increase  the  order  of  the  AR  model  to  estimate  the 
spectrum  of  the  noisy  signal.  Figure  4.23  and  Table  4.6  indicate  that 
the  higher  the  model  order,  the  lower  the  error  of  the  estimated 
formants.  However,  if  the  model  order  is  much  higher  than  that  of  the 
true  process  (i.e.,  if  the  order  (IP)  of  the  synthetic  speech  signal  is 
six),  then  using  IP=14  for  the  analysis  gives  rise  to  some  unexpected 
formants  and  thus  degrades  the  accuracy  of  the  estimated  formants. 

ARMA  Model  Approximation  for  Noisy  Signal  Analysis 

To  use  an  ARMA  model  for  estimating  the  AR  parameters  of  noisy 
signals,  we  need  to  identify  the  different  noise  sources,  namely,  the 
additive  background  noise  such  as  room  noise  and  the  input  white  noise 
to  the  ARMA  process.  The  background  noise,  vj,  can  be  represented  by 
an  MA  process 
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vk  = vk  + nwk-l  + •••  + Ymvk_m  (4.11) 

and  is  always  present  regardless  of  the  presence  or  absence  of  the 
signal.  This  noise  process  is  locally  stationary  and  the  values  of 

Y^'s  are  constant  and  wk's  are  represented  by  a zero  mean  white 
Gaussian  noise  with  constant  variance.  The  observation  (measured) 
signal  corrupted  by  the  background  noise  can  be  expressed  as 

sk  = yk  + vk  (4.12) 

where  y^  is  an  output  of  an  ARMA  process  as  shown  in  (3.1),  namely, 

P q 

yk  = " 1 ai  yfc-i  + 1 bjuk-j  (4.13) 

i=l  j=o 

where  the  source  signal  u^  is  assumed  to  be  white  Gaussian  noise. 

From  (4.9)  and  (4.10),  we  have 


sk  = yk  + wk  + Ylwk-1  + •••  +Ymwk-m  (4.14) 

Substituting  (4.14)  into  (4.13)  and  rearranging  with  respect  to  the 
regression  of  sk,  we  have 

p q p+m 

^ ai  sk-i  = ^ bj  uk-j  + ^ &j  wk-j  (4.15) 

i=°  j=o  j=o 

where 

j 

Sj  = I a;  Yj  _i , aQ  = bQ  = y0  =1, 
i=o 

ai  = 0 (i  > p),  Yi  = 0 (k  > m). 

This  complete  model  of  the  observation  process  is  represented  by 
an  extended  ARMA  model  with  two  noise  sources  [Morikawa  and  Fujisaki, 
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1985].  In  this  formulation,  the  unique  estimation  of  the  AR  parameters 
of  the  underlying  system  is  difficult  since  two  sets  of  the  MA 
parameters  of  the  two  noise  processes  need  to  be  estimated 
simultaneously.  Here  we  consider  two  special  cases,  namely,  high  SNR 
and  low  SNR  conditions. 

When  the  SNR  is  high,  we  can  write 
q p+m 

E bj  uk_j  » E 5j  wk_j  (4.16) 
J=o  j=o 

Then,  we  obtain  the  same  formula  as  described  in  (3.29).  In  other 
words,  the  WRLS-VFF  method  can  be  applied  in  noise-free  as  well  as  in 
high  SNR  environments. 

Under  a low  SNR  condition,  we  can  assume  that 
q p+m 

E bj  uk-j  « z Sj  wk_j  (4.17) 
3=o  j =o 

Then,  the  AR  parameter  estimation  problem  involves  the  estimation  of 
the  MA  parameters  (wj)  from  a background  noise  process.  It  is 
necessary  to  develop  a method  to  estimate  these  w^s  before  executing 
the  adaptive  algorithm.  Simply  using  an  ARMA  model  and  increasing  the 
order  of  the  MA  process  does  not  guarantee  an  improvement  of  the 
spectral  estimates  of  the  noisy  signal.  This  can  be  seen  in  Figure 
4.24  and  Table  4.7. 

In  this  study,  we  only  consider  the  noise-free  condition.  For 
the  analysis  of  the  noisy  signal,  some  noise  cancellation  techniques 
[Widrow  et  al.,  1975;  Griffiths,  1978]  can  be  applied  before  using  the 
WRLS-VFF  algorithm. 
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(a)  SNR=0  DB 


Figure  4.21  Synthetic  speech  signal  SYN2  corrupted  by  noise. 
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Table  4.5  Formant  estimation  error  (EP)  of  different  algorithms  for 
the  synthetic  speech  signal  SYN2  /a/  with  different  signal 
to  noise  ratios  (SNRs). 


SNR 

0 DB 

5 DB 

10  DB 

15  DB 

20  DB 

EP  (X) 
LPC  IP=10 

22.5 

16.8 

7.98 

4.42 

3.70 

EP  (X) 
SEARMA 

21.7 

16.0 

7.45 

4.56 

3.81 

EP  (X) 
WRLS-VFF 

21.6 

12.1 

6.07 

3.33 

2.98 

Table  4.6  Forn 
synt 

nant  estimation  error  (Ep)  using  WRL-VFF  algc 
:hetic  speech  SYN2  with  SNR=20  DB. 

)rithm  for 

AR 

model  order 

AR(6) 

AR(8) 

AR(10) 

AR( 12 ) 

AR(14) 

EP  (X) 
WRLS-VFF 

120.0 

16.74 

2.98 

1.32 

1.687 

able  4.7  Forms 
synt 

int  estimation  error  (Ep)  using  WRLS-VFF  algc 
hetic  speech  SYN2  with  SNR=20  DB  and  variabl 

)rithm  for 
e MA  order. 

Model  order 

AR(10) 

^RMA(10,2)  1 

^RMA( 10,4) 

^RMA( 10 , 6) 

ARMA(10,8) 

EP  (X) 
WRLS-VFF 

2.98 

1.97 

2.51 

1.22 

2.67 

88 


Figure  4.22  Estimated  spectra  using  the  VRLS-VFF  algorithm 

of  the  synthetic  speech  signal  SYN2  with  different 
SNRs : (a)  SNR  = 0 DB,  (b)  SNR  = 5 DB,  (c)  SNR  = 10 
DB,  (d)  SNR  = 15  DB,  (e)  SNR  = 20  DB. 
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Figure  4.23  Estimated  spectra  using  the  VRLS-VFF  algorithm  of 
the  synthetic  speech  signal  SYN2  with  different 
orders  of  an  AR  process: 

(a)  original  spectrum  for  AR(6), 

(b)  estimated  spectrum  for  SNR  = 20  DB  and  AR(6), 

(c)  estimated  spectrum  for  SNR  = 20  DB  and  AR(8), 

(d)  estimated  spectrum  for  SNR  = 20  DB  and  AR(10), 

(e)  estimated  spectrum  for  SNR  = 20  DB  and  AR(12). 
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(e) 


(d) 


(c) 


(b) 


(a) 


Figure  4.24  Estimated  spectra  using  the  WRLS-VFF  algorithm  of 
the  synthetic  speech  signal  SYN2  with  SNR  = 20  DB 
and  different  orders  of  an  MA  process: 

(a)  AR(10),  (b)  ARMA(10,2),  (c)  ARMA(10,4), 

(d)  ARMA( 10 , 6) , (e)  ARMA(10,8). 
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Complexity  Analysis  of  the  VRLS-VFF  Algorithm 
The  VRLS-VFF  algorithm  presented  in  (3.29)  requires  on  the  order 
of  5n2  + 6n  multiplications  and  additions  (flops)  per  data  point  (see 
Appendix  D for  a detailed  complexity  analysis)  where  n=p+q,  p and  q are 
the  model  orders  of  the  AR  and  MA  process,  respectively.  Comparing  the 
flops  required  in  the  VRLS-VFF  algorithm  with  those  of  the  block  data 
processing  techniques  used  for  AR  parameter  estimation,  the  amount  of 
computation  in  the  VRLS-VFF  algorithm  may  become  excessively  large  with 
increasing  analysis  samples  (N).  Table  4.8  shows  that  the 
computational  complexity  for  estimating  AR  parameters  using  N samples 
requires  only  p2  + pN  flops  for  the  autocorrelation  method  (Durbin's 
algorithm)  and  (5/2)p2  + Np  flops  for  the  lattice  method  (Burg's 
algorithm).  In  comparison,  using  the  VRLS-VFF  algorithm  requires  5Np2 
+ 6Np  flops.  Furthermore,  some  fast  adaptive  least  squares  algorithms 
can  reduce  the  computational  complexity  to  O(Np)  for  analyzing  N data 
points  (Table  4.9). 

For  ARMA  parameter  estimation,  conventional  block  data  processing 
techniques,  such  as  the  least  squares  modified  Yule-Valker  equation 
(LSMYVE)  method,  require  on  the  order  of  (l/6)p3  + Np2/2  + Np  flops 
(Appendix  E).  The  VRLS-VFF  algorithm,  however,  requires  on  the  order 

of  5N(p+q)2  + N( p+q ) flops.  Therefore,  the  main  disadvantage  of  the 
VRLS-VFF  algorithm  is  its  computational  complexity. 

The  main  factor  contributing  to  the  cost  of  the  complexity  in  the 
VRLS-VFF  algorithm  is  the  computation  of  the  gain  vector  Pk+k,  which 
requires  a number  of  flops  proportional  to  (p+q)2  per  data  point. 
Thus,  it  is  important  to  search  for  more  efficient  algorithms  for 
matrix  multiplication.  Using  the  idea  of  shift  low  rank  developed  by 
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Morf  and  Lee  [1979]  has  led  to  implementations  of  the  WRLS-VFF 

algorithm  requiring  a number  of  flops  proportional  to  0(N(p+q))  rather 
than  to  0(N(p+q)2). 


Discussion 

In  this  chapter  we  have  tested  several  parameter  estimation 
algorithms  for  speech-like  time-varying  signals.  First,  we  compared 
the  performance  of  the  estimation  accuracy  and  spectral  matching 
ability  between  the  block  processing  approach  and  the  proposed 
recursive  method.  Then  several  recursive  algorithms  were  also  tested 
for  nonstationary  signals.  All  the  experimental  results  and  figures 
show  that  the  performance  of  the  WRLS-VFF  algorithm  in  tracking  the 
spectral  variations  and  accuracely  estimating  signal  parameters  is 
superior  to  both  the  block  processing  techniques  and  the  other 
recursive  algorithms.  This  is  because  the  WRLS-VFF  algorithm  uses  a 
variable  forgetting  factor,  which  can  be  obtained  recursively  during 
the  adaptive  process,  to  control  the  adaptive  gain,  determine  the 
effective  memory  length,  estimate  the  input  excitation,  and  eliminate 
its  influence.  The  only  disadvantage  of  this  algorithm  over  the  block 
processing  methods  is  the  computational  load,  since  the  adaptive 
process  is  accomplished  sequentially  on  a sample-by-sample  basis. 
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Table  4.8  Computational  cost  for  AR  parameter  estimation  using 
block  (batch)  data  processing  techniques  with  order  p 
and  N analysis  samples. 


Algori thm 

Multiplications 
and  additions 

References 

Autocorrelation  method 
(Durbin's  algorithm) 

p2  + pN 

[Makhoul,  1977] 

Covariance  method 
(Choleskey  decomposition) 

(l/6)p3  + p2N/2 

[Graupe  et  al. , 1980; 
Golub,  1983] 

Lattice  method 
(Burg's  algorithm) 

(5/2)p2  + pN 

[Makhoul,  1977; 
Graupe  et  al,  1980] 

Table  4.9  Computational  cost  for  AR  parameter  estimation  using 
recursive  techniques  with  order  p for  each  sample. 


Algorithms 

Multiplications 
and  additions 

References 

Weighted  recursive 
least  square  (WRLS)  method 

4p2  + 5p 

[Friedlander , 1982b] 

WRLS-VFF  method 

5p2  + 6p 

[see  Appendix  4A] 

1 

Fast  adaptive  forward  and 
backward  LS  method 

9p 



[Kalouptisdis  and 
Theodorids,  1987] 

Unnormalized 
lattice  method 

13p 

[Morf  and  Lee, 1979] 

Normalized  fixed  order 
method 

9p 

with  2 sq.  roots 

[Coiffi  and  Kailath, 
1984] 

CHAPTER  5 


APPLICATIONS  OF  THE  WRLS-VFF  ALGORITHM  FOR  SPEECH  SIGNAL  ANALYSIS 

Introduction 

In  this  chapter  we  present  three  applications  using  the  WRLS-VFF 
algorithm  for  speech  signal  analysis.  For  a nonstationary  speech 
signal,  accurately  tracking  speech  parameters  like  vocal  tract 
resonance  frequencies  (formants)  and  their  bandwidths  is  essential  for 
the  development  of  speech  recognition  and  speech  synthesis  systems. 
Using  the  frame-based  linear  predictive  coding  (LPC)  analysis,  the 
accuracy  of  formant  tracking  of  a speech  signal  is  affected  by  1)  the 
position  of  the  analysis  frame,  2)  the  length  of  the  analysis  window, 
and  3)  the  time-varying  characteristics  of  the  speech  signal.  An 
adaptive  filter  approach,  that  tracks  the  time-varying  parameters  of 
the  vocal  tract  and  updates  the  parameters  during  the  glottal  closed 
phase  interval,  can  reduce  the  formant  estimation  error.  In  the  first 
section  of  this  chapter,  we  introduce  a closed  phase  WRLS-VFF  algorithm 
for  formant  tracking  and  compare  its  performance  with  other  methods 
such  as  closed  phase  covariance  analysis  and  the  weighted  least  squares 
lattice  method.  In  the  second  section,  we  discuss  the  estimation  of  a 
glottal  excitation  source  using  the  glottal  inverse  filtering  (GIF) 
analysis  method.  The  WRLS-VFF  algorithm,  which  sequentially  estimates 
the  filter  coefficients,  estimation  error,  the  variable  forgetting 
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factor,  the  pitch  period,  and  the  instant  at  which  the  glottis  closes 
can  be  easily  applied  to  GIF  analysis.  An  implementation  of  the  GIF 
method  based  on  the  WRLS-VFF  algorithm  is  presented.  The  performance 
of  this  method  is  compared  with  other  methods  (i.e.,  the  two-pass  and 
two-channel  methods). 

Finally,  we  derive  an  ARMA  cepstral  distance  measure  based  on  the 
WRLS-VFF  algorithm.  The  ARMA  cepstral  coefficients  can  be  obtained 
from  a set  of  recursive  equations,  which  do  not  require  the  Fast 
Fourier  Transform  (FFT).  Some  potential  applications  of  the  ARMA 
cepstral  coefficients  are  discussed. 

Estimation  the  Speech  Parameters  Based  on  the  WRLS-VFF  Algorithm 
Modeling  of  the  Human  Speech  Production  System 

The  model  of  the  acoustic  production  of  the  speech  signal,  based 
on  the  "source-filter"  concept  [Fant,  1960]  is  shown  in  Figure  5.1. 
This  speech  production  model  (SPM)  is  composed  of  three  acoustic 
filters,  namely,  glottal  shaping  filter,  vocal  tract  filter,  and  lip 
radiation  filter.  The  input  excitation  un  to  the  SPM  is  a pulse  train 
with  period  P for  voiced  sounds  and  random  noise  having  a flat  spectrum 
for  unvoiced  sounds. 

Glottal  shaping  filter.  This  filter  can  be  represented  by  a two- 
pole  low-pass  filter  with  an  estimated  cutoff  at  about  100Hz  [Markel 
and  Gray,  1976].  The  z-transform  of  the  glottal  shaping  filter  G(z)  is 
given  by 

1 

G(z)  = 


(1  - e 


(5.1) 


96 


excitation 


glottal 

volume 

velocity 


lip  speech 

volume 

velocity 


Figure  5.1  Speech  production  model. 


Gain  (G) 


Figure  5.2  Linear  prediction  model. 
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where 

plane. 


e_cT  is  the  location  of  a pole  inside  the 
The  output  of  this  filter  is  the  glottal 


unit  circle  in  the  z- 
volume  velocity,  gn. 


Vocal  tract  model.  The  vocal  tract  model,  based  on  an  acoustic 
tube  approximation  [Rabiner  and  Shafer,  1975],  is  assumed  to  be  an  all- 
pole model  consisting  of  a cascade  of  a small  number  of  two-pole 
resonators.  Each  resonance  is  defined  as  a formant  with  a 
corresponding  center  frequency  and  bandwidth.  The  z-transform  of  this 
model  V(z) , consisting  of  K formants,  is  described  by 


K 1 

v(z)  = n — .c  2) 

i=l  1-2  e-CiT  cos(biT)  z'1  + e-<4T  z-2  V 

where  the  ith  formant  frequency  and  bandwidth  are  computed  from 
Fi=bi/2rc  and  B^cj/n,  respectively.  The  output  of  this  filter,  with 
input  gn,  is  the  lip  (oral)  volume  velocity,  vn. 


Lip  radiation  filter.  The  speech  pressure  wave  is  related  to  the 
oral  volume  velocity  at  the  lip  through  a radiation  impedance  R(z). 
For  frequencies  below  about  4000  Hz,  R(z)  can  be  represented  by  a high 
pass  filter  [Flanagan,  1972] 


Overall  model.  The  overall  linear  model  of  speech  production  for 
an  acoustic  sound  pressure  transform  S(z)  is  the  product  of  the  z- 
transform  of  the  lip  radiation  filter,  the  vocal  tract  filter,  the 
glottal  shaping  filter,  and  the  excitation,  and  is  given  by 
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S(z)  = U(z)  G(z)  V(z)  R(z)  (5.4) 

The  offset  of  the  three  acoustic  filters  can  be  grouped  together 
and  represented  by  a single  filter,  which  is  termed  "vocal  tract 
system"  of  the  SPM.  The  transfer  function  of  the  vocal  tract  system, 
Hs(z),  is  given  by 

Hs(z)  = S(z)/U(z) 

(1  - z-1)  G 
K 

(1  - e-cTz-1)2  n [1  - 2e~ciTcos(biT)z“1  + e~2ciTz-2] 
i = l 

(5.5) 

where  K formants  are  defined  in  the  model  and  G is  the  total  effective 
filter  gain  that  matches  the  energy  or  average  value  of  the  input 
spectrum  to  the  model  spectrum. 

There  is  only  one  numerator  term  (l-z_l)  in  (5.5)  and  it  is 
nearly  cancelled  by  one  of  the  denominator  terms  [ l-exp(-cT)z_]  since 
cT  is  generally  much  less  than  unity  [Markel  and  Gray,  1976].  A 
further  reduction  of  the  discrete  model  in  (5.5)  can  give  a simplified 
transform  function  of  the  SPM,  namely, 

H(z)  = (5.6) 

P 

1 + E aj  z~1 
i=l 

where  is  the  filter  coefficient  and  p=2*K+l,  is  the  filter  order. 

In  speech  analysis,  we  are  interested  in  techniques  for 
estimating  the  parameters  of  the  SPM  from  the  speech  signal.  In  speech 
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synthesis,  we  reproduce  the  acoustic  speech  signal  by  controlling  the 
various  parameters  of  the  SPM.  Among  these  techniques,  the  linear 
prediction  coding  (LPC)  method  has  been  widely  used  for  estimating  the 
vocal  tract  characteristics  of  speech.  We  will  discuss  the  basis  of 
the  LPC  method  and  some  drawbacks  of  this  model  in  the  following 
sections . 


LPC  Analysis  of  the  Speech  Signal 

Linear  prediction  model  for  speech  production.  Figure  5.2 
presents  the  all-pole  model  of  linear  prediction  (LP).  We  know  that 


U(z)  G 

S(z)  = (5.7) 

P 

1 + E aj  z 1 
i=l 


which  in  the  time  domain,  becomes 
P 

sn  = - E a4  sn_i  + G un  (5.8) 

i = l 


Let  us  assume  that  given  a time  series  of  speech  samples,  the  current 
sample  can  be  estimated  as  a linear  combination  of  p past  samples. 
Mathematically, 


sn  = - r ai  sn_i  (5.9) 

i=l 


The  error  between  the  value  of  the  actual  sample  and  its  estimate  is 

P 

en  = sn  “ §n  = sn  + 2 *i  sn-i 

i=l 


(5.10) 
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and  equivalently, 

P 

sn  = _ 2 ai  sn-i  + en  (5.11) 

If  the  linear  prediction  model  of  (5.11)  conforms  to  the  SPM  given  by 
(5.6),  then  en=Gun  and  a^=aj.  Thus  the  coefficients  aj,  i=l,..,p 

identify  the  system,  whose  output  is  sn.  The  problem  is  to  determine 
the  values  of  the  coefficients  aj  from  the  speech  signal. 

Solutions to  the  LP  model.  The  criterion  used  to  obtain  the 

coefficients  (a^)  is  the  minimization  of  the  squared  prediction  error  E 
with  respect  to  each  coefficient  aj,  over  some  time  interval,  where 

E . I en2 
n 

This  leads  to  the  following  set  of  normal  equations: 

P 

. 2 di  2 sn-i  sn_k  = - E sn  sn_k  1 < k < p (5.12) 

i=l  n n 

For  a short-time  analysis,  the  limit  of  summation  is  finite.  The 
particular  choice  of  these  limits  has  led  to  two  methods  of  analysis, 
namely,  the  autocorrelation  method  (AUTO)  and  the  covariance  method 
[Markel  and  Gray,  1976]. 

The  autocorrelation  method  results  in  a filter  structure  that  is 
guaranteed  to  be  stable.  At  the  same  time,  it  operates  on  a data 
segment  that  is  windowed  using  a Hanning  or  a Hamming  window,  typically 
20-30  ms  long  (two  to  three  pitch  periods). 
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The  covariance  method,  on  the  other  hand,  gives  a filter  that  is 
not  guaranteed  to  be  stable,  but  requires  no  explicit  windowing. 

Problems  of  the  LP  analysis  of  speech.  In  the  AUTO  method,  the 
autocorrelation  function  of  a speech  signal  contains  information  about 
the  vocal  tract  (VT)  transfer  function  as  well  as  the  excitation 
source,  which  is  a series  of  pulses  or  white  noise.  Hence,  the 

accuracy  of  the  extracted  parameters  based  on  the  autocorrelation 
function  is  affected  by  the  characteristics  of  the  excitation  source, 
especially  for  voiced  speech  with  a short  interval  [Moirkawa  and 
Fujisaki,  1982;  Miyoshi  et  al.,  1987].  Furthermore,  we  often  assume 
that  the  speech  signal  is  stationary  during  the  analysis  window  for  the 
AUTO  method,  but  this  is  only  true  for  vowels.  However,  in  many  cases 
such  as  diphthongs  /ai/,  /ei/,  sliding  sounds  /we/,  and  transitions 

between  vowels  and  consonants,  the  vocal  tract  parameters  change  with 
time.  The  AUTO  method  gives  an  "average"  estimate  of  the  VT 

parameters,  but  cannot  track  the  variations  of  these  parameters. 
However,  tracking  the  variations  of  VT  parameters  is  important  for  many 
speech  synthesis  applications. 

The  problems  of  the  AUTO  method  can  be  reduced  using  the 
covariance  method  when  the  analysis  window  is  restricted  to  a maximum 
of  one  pitch  period  in  length.  The  effectiveness  of  the  covariance 
method  relies  on  accurately  identifying  the  locations  of  the  excitation 
pulses.  If  the  analysis  interval  (even  if  one  pitch  period  long) 
straddles  two  pitch  periods,  then  the  covariance  method  will  not  yield 
a reasonable  estimate  of  the  VT  parameters  [ Parthasarathy  and  Tufts, 
1987].  One  practical  approach  for  the  covariance  method  is  to  locate 
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the  pitch  period  first,  and  then  analyze  the  data  in  the  interval 
synchronized  with  the  pitch  period.  This  is  called  the  "pitch- 
synchronous  covariance  method." 

Childers  and  Larar  [1984]  indicated  that  even  if  the  analysis 
interval  is  restricted  to  one  period  in  the  pitch  synchronous 
covariance  method,  there  will  still  be  a parameter  estimation  error  due 
to  source-tract  interaction.  This  is  because  the  speech  waveforms  in 
one  pitch  period  encompass  several  glottal  excitation  events  (e.g.,  the 
glottal  closed  and  open  phase  which  cannot  be  represented  as  a single 
excitation  segment). 

Furthermore,  the  conventional  LP  method  is  based  on  the 
assumption  that  the  vocal  tract  characteristics  can  be  represented  by 
only  a small,  finite  number  of  poles.  If  zeros  exist  in  the  transfer 
function,  as  is  the  case  with  nasalized  sounds,  then  the  error  in 
estimating  the  vocal  tract  transfer  function  using  the  LP  method  will 
increase. 

Based  on  the  above  discussion,  the  problems  of  the  LP  analysis  of 
speech  can  be  summarized  as 

(1)  Estimation  errors  due  to  the  excitation  source. 

(2)  Estimation  errors  due  to  the  source-tract  interaction. 

(3)  Estimation  errors  due  to  the  time-varying  vocal  tract 

characteristics . 

(4)  Estimation  errors  due  to  the  all— pole  model  when  the  analyzed 
signal  is  generated  from  a pole-zero  model. 

Solutions  to  the  Problems  of  LP  Analysis.  Problem  (1)  and  (2) 
can  be  alleviated  if  the  analysis  interval  is  restricted  to  encompass 
only  the  closed  glottal  region  [Berouti,  1976;  Wong  et  al.,  1979; 
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Veenman  and  BeMent,  1985;  Krishnamurty  and  Childers,  1986].  This  is 
referred  to  as  "closed  phase  covariance  method."  Problem  (3)  can  be 
overcome  if  recursive  algorithms  are  used  to  track  the  time-varying  VT 
parameters.  Program  (4)  can  only  be  solved  if  we  use  an  ARMA  model. 

One  approach  to  solve  these  problems  simultaneously  is  to  use  a 
closed  phase  WRLS-VFF  AR  (ARMA)  algorithm,  which  will  be  discussed  in 
the  next  section. 

Pi tch— Synchronous  Closed  Phase  Analysis  of  the  Speech  Signal 

Glottal  closed  phase  detection.  Past  use  of  a pitch-synchronous 
closed  phase  analysis  method  has  been  limited  due  to  difficulties 
associated  with  the  task  of  accurately  isolating  the  closed  phase 
region  in  successive  periods  of  speech.  Krishnamurty  and  Childers 

[1985]  introduced  a method  using  an  electroglottographic  (EGG)  signal 
to  determine  the  glottal  closed  phase.  Simply  speaking,  the  EGG 

monitors  variations  in  vocal  fold  contact  by  measuring  tissue  impedance 
[Appendix  F ] ; glottal  closure  is  associated  with  a reduction  in  tissue 
impedance.  A time  lag  of  approximately  0.7  ms  for  the  acoustic 

propagation  delay  from  the  glottis  to  the  microphone  is  applied  to  the 
EGG  signal  when  it  is  compared  with  the  speech  signal.  In  Veenman's 
method,  intervals  of  closed  phase  are  determined  by  thresholding  the 
EGG  signal  at  50  percent  of  the  peak  amplitude.  Childers  and 
Krishnamurty  provide  a more  accurate  method  that  uses  the  EGG  and 
differentiated  EGG  (DEGG)  signal  to  determine  the  glottal  closed  phase 
region  automatically. 
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Closed  Phase  Covariance  (CPC)  method.  The  basic  principle  of  the 
CPC  method  is  to  analyze  speech  signals  only  over  the  closed  glottal 
interval.  During  the  interval  of  glottal  closure,  the  glottal  volume 
velocity,  gn,  is  zero,  and  so  is  the  excitation,  un.  Thus,  (5.8) 
becomes 

P 

sn  = E ai  sn-i  (5.13) 

That  is,  one  sample  after  the  glottal  closure  instant,  the  speech 
waveform  is  strictly  a function  of  the  vocal  tract  resonances  specified 
by  alf...,ap  and  the  initial  conditions  sn, . . . , sn_p+1 . This  result 
holds  over  the  entire  closed  glottal  interval.  The  filter  coefficients 
ai  can  be  estimated  by  the  covariance  method  of  LP  analysis. 

A comparison  of  different  LP  methods  for  speech  analysis  has  been 
studied  in  [Krishnamurty  and  Childers,  1986].  The  results  showed  the 
superiority  of  the  CP  covariance  method  in  terms  of  estimation  accuracy 
and  tracking  ability  over  several  other  common  LP  methods,  such  as  LPC 
asynchronous,  pitch-synchronous  covariance,  and  pitch-synchronous 
circulation  methods. 

Closed  Phase  Weighted  Least  Squares  Lattice  (WLSL)  method.  The 
closed  phase  WLSL  algorithm  is  attractive  for  speech  analysis  for 
several  reasons: 

(1)  The  lattice  structure  and  acoustic  tube  reflection 
coefficients  (RC)  used  to  model  the  vocal  tract  are  related. 

(2)  The  quantization  properties  of  RC  are  advantageous  in  speech 
analysis  and  the  arithmetic  properties  of  finite-word  length  lattice 
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filters  are  beneficial. 

(3)  The  RC  are  known  to  vary  slowly  across  the  boundaries  of 
various  speech  sounds,  thus  creating  manageable  coefficient  update 
rates. 

The  block  diagram  for  speech  analysis  based  on  the  closed  phase 
VLSL  method  is  shown  in  Figure.  5.3  [Ting  et  al.,  1988].  The  recursive 
equations  in  the  algorithm  can  be  found  in  [Lee  et  al.,  1981].  The 
likelihood  variable  yp>T  , where  p is  the  filter  order  and  T is  the 
time  index,  can  be  computed  recursively  during  the  LSL  recursion.  It 
has  been  shown  [Cowan  and  Grant,  1985]  that  the  changes  in  Yp  T 
indicate  a change  in  the  signal,  e.g.,  a change  of  the  state  of  the 
glottis  (open  or  closed)  in  the  speech  production  process.  Experiments 
[Cowan  and  Grant,  1985]  suggest  that  the  maximal  slope  of  the 
likelihood  variable  is  related  to  the  glottal  closing  instant.  This 
fact  may  be  combined  with  a threshold  on  the  likelihood  variable  to 
estimate  the  closed  glottal  interval  from  the  speech  signal  only.  The 
threshold  level  can  be  estimated  by  using  off-line  EGG  signals.  The 

filter  coefficients  can  be  derived  from  the  reflection  coefficients  at 
the  optimal  position. 

Closed  Phase  URLS-VFF  method.  The  closed  phase  WRLS-VFF  method 
is  based  on  the  algorithm  developed  in  chapter  three.  In  this 
algorithm,  the  estimation  error  is  usually  large  when  there  is  a large 
glottal  open  phase  in  the  speech  signal.  In  such  cases,  a small  is 
used  to  decrease  the  contributions  to  the  error  in  the  estimation 
process.  Therefore,  small  values  of  A^  are  related  to  the  instant  at 
which  the  glottis  closes,  where  the  prediction  error  is  maximum,  temp- 
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Speech 


Figure  5.3.  Closed  phase  VLSL  method. 
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Figure  5.4  Speech  signal  SMI  (upper),  the  corresponding  DEGG 

signal  (middle),  and  the  variable  forgetting  factor  V 
(lower).  x 


108 


Figure  5.4  shows  that  the  regions  of  small  correspond  to  the 
negative  peaks  of  the  DEGG  signals  which  are  reliable  estimates  of  the 
closing  instant  of  the  glottis  [Childers  and  Krishnamur thy , 1985].  The 
closed  phase  WRLS-VFF  algorithm  for  speech  analysis  extracts  the  vocal 
tract  parameters  only  from  the  glottal  closed  interval.  The  selection 
of  the  AR  or  ARMA  model  for  the  WRLS-VFF  algorithm  depends  on  the 
signal  to  be  analyzed.  This  algorithm  can  be  implemented  as  follows: 

(1)  Initialize  the  values  of  P0,  90,  A,,,.^,  and  Z IQ,  and  specify 
the  filter  order.  (Experience  shows  that  the  values  of  Po  and  Z IQ  are 
insensitive  to  the  algorithm  as  long  as  they  are  large  enough  (e.g., 
Po=102,  Z Io=106)). 

(2)  Compute  the  filter  gain  K^,  error  covariance  P^  and 
prediction  error  v^  using  (3.28). 

(3)  Compute  \ using  (3.17),  if  then  \=\,in. 

(4)  Predict  the  new  filter  coefficient  vector  9^. 

(5)  Check  for  the  glottal  closed  phase  using  the  EGG  signal  or 
^k>  if  there  is  a closed  glottal  interval,  then  extract  the  formants 
and  bandwidths  of  the  vocal  tract  by  solving  for  the  roots  of  the 
polynomial  obtained  from  9^. 

(6)  Go  to  (2)  until  end  of  data. 

Data  Collection 

The  EGG  and  acoustic  speech  waveforms  were  simultaneously 
digitized  and  stored  on  a computer  disk.  These  two  signals  were 

recorded  with  the  speakers  situated  inside  an  Industrial  Acoustics 
Company  single-wall  sound  room.  The  speech  signal  was  obtained  with  an 
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Electro-Voice  RE-10  dynamic  cardoid  microphone  and  a Bruel  & Kjaer  (B  & 
K)  model  4133  condenser  microphone.  The  EGG  signal  was  obtained  with  a 
Synchrovoice  electroglot tograph . The  microphone  was  held  at  a fixed 
distance  (6  inch)  from  the  speaker's  lips.  The  signal  digitization  was 
accomplished  by  a Digital  Sound  Corporation  (DSC)  model  200  stereo  A/D 
and  D/A  system.  The  DSC-200  system  has  16-bit  accuracy.  The  signals 
were  digitized  with  a sampling  frequency  of  10  kHz,  with  a 5 kHz 
antialiasing  filter  being  used  before  digitization. 

The  following  isolated  words  and  sentences  spoken  by  a male 
subject  (DMH)  and  a female  subject  (DDH)  were  used  for  analysis  in  this 
study. 

Isolated  words: 

- "five"  for  diphthong  /ai/  (Figure  5.5). 

- "seven"  for  fricative  /s/  and  nasal  /n/  (Figure  5.6). 

Sentences : 

- Ve  were  away  a year  ago.  (vowels) 

- Early  one  morning  a man  and  a woman  ambled  along  a one  mile 
lane,  (nasalized  vowels) 

- Should  we  chase  those  cowboys?  (fricatives) 

Experimental  Results 

Test  conditions.  The  data  from  the  two  isolated  words  and  three 
sentences  were  analyzed  by  the  above  three  closed  phase  methods.  Due 
to  the  nonstationarity  of  speech  signals,  the  performance  of  each 
algorithm  was  evaluated  on  the  tracking  ability  of  formants  and 
bandwidths.  For  formant/bandwidth  extraction,  the  root  of  the 
polynomial  A(z)  for  the  AR  model  (4.7)  or  A(Z)  and  B(z)  for  ARMA  model 
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Figure  5.5  (a)  Speech  waveform  for  the  utterance  "five" 
spoken  by  DMH . 


Figure  5.5  (b)  EGG  waveform  corresponding  to  Figure  5.5  (a). 
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Figure  5.6  (a)  The  speech  waveform  for  the  utterance  "seven", 
spoken  by  DMH. 


Figure  5.6  (b)  EGG  waveform  corresponding  to  Figure  5.6  (a). 
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(4.8)  were  determined  for  each  glottal  closed  phase  for  each  period. 
All  roots  with  a Q factor  (center  frequency  divided  by  bandwidth)  less 
than  one  were  eliminated.  The  remaining  roots  were  retained  as 
potential  formant  roots,  and  constituted  the  raw  formant  data.  No 
other  form  of  formant  trajectory  smoothing  or  filtering  was  done.  The 
analysis  conditions  for  each  method  are  indicated  below. 

- A 12th  order  LP  coefficient  was  used  for  the  AR  model,  and 
IP=12,  IQ=6  for  the  ARMA  model.  (The  method  to  determine  the  order  of 
the  AR  model  is  discussed  at  the  end  of  this  section.) 

- No  preemphasis  was  used,  since  the  closed  phase  vocal  tract 
filters  derived  with  and  without  preemphasis  were  virtually  the  same 
[Krishnamurty , 1983]. 

Test  results  for  analyzing  male  voice.  Figures  5.7,  5.8,  and  5.9 
show  the  formant  tracking  contours  and  log  magnitude  spectral  plots  for 
the  utterance  "five,"  estimated  by  CPC,  closed  phase  WRLSL,  and  closed 
phase  WRLS-VFF  (AR)  methods,  respectively. 

Figures  5.10,  5.11,  and  5.12  show  the  formant  tracking  contours 
and  log  magnitude  spectral  plots  for  the  utterance  "seven,"  estimated 

by  CPC,  closed  phase  WRLSL,  and  closed  phase  WRLS-VFF  (AR)  methods, 
respectively. 

Figure  5.13  shows  the  formant  tracking  contours  and  log  magnitude 
spectral  plots  for  the  utterance  "seven,"  estimated  by  closed  phase 
WRLS-VFF  (ARMA)  method. 

An  examination  of  the  spectral  plots  of  these  figures  shows  that 
the  WRLS-VFF  method  is  better  than  either  the  WLSL  or  CPC  in  terms  of 
tracking  ability  and  smoothness  of  the  tracks  of  the  bandwidth 
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estimates.  The  WLSL  method  is  unable  to  track  the  formants  accurately, 
especially  during  the  vowel  transition  region  /a/-/i/  and  almost  looses 
the  track  after  the  transition,  as  shown  in  Figure  5.8.  The  CPC  method 
introduces  abrupt  changes  in  the  formant  bandwidths. 

For  fricative  and  nasalized  sounds  analyses,  as  in  the  case  of 
seven,  the  WRLS-VFF  (ARMA)  method  tracks  the  spectral  zero  variations 
(antiformants  and  antibandwidths)  as  shown  in  Figure  5.13. 

Figures  5.14  and  5.15  are  the  formant  tracking  contours  for  the 
utterance  "we  were  away"  for  the  WRLS-VFF  (AR)  and  CPC  method, 
respectively. 

The  formant  tracking  contours  estimated  by  the  WRLS-VFF  (AR)  and 
CPC  methods  for  the  utterance  "a  one  mile  lane"  are  shown  in  Figures 
5.16  and  5.17,  respectively. 

The  formant  and  antiformant  tracking  contours  estimated  by  the 
WRLS-VFF  (ARMA)  method  for  the  utterance  "a  one  mile  lane"  are  shown  in 
Figure  5.18  and  5.19,  respectively. 

The  formant  tracking  contours  estimated  by  the  WRLS-VFF  (AR)  and 
CPC  methods  for  the  utterance  "should  we  chase"  are  shown  in  Figures 
5.20  and  21,  respectively. 

An  examination  of  these  figures  shows  that  the  WRLS-VFF  method 
gives  the  best  results  for  the  formant  trajectories,  in  the  sense  that 
there  are  very  few  sudden  jumps  in  the  contours.  For  the  CPC  method, 
when  there  are  rapid  changes  of  the  vocal  tract  parameters  as  in  the 
case  of  the  glide  sound  /w/-/e/  (Figure  5.15),  the  ability  to  track  the 
formant  changes  is  not  as  good  as  the  WRLS-VFF  method.  For  analyzing 
the  nasalized  utterance  "a  one  mile  lane,"  the  WRLS-VFF  (ARMA)  method 
provides  smoother  formant  tracks,  as  shown  in  Figure  5.18. 
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Furthermore,  the  antiformant  tracks  can  also  be  tracked  well,  as  shown 
in  Figure  5.19. 

T.est results  for  analyzing  female  voices.  The  female  voice 

usually  has  a higher  fundamental  frequency  than  a male  voice.  This 
results  in  a short  pitch  period  during  the  analysis.  Experimental 
results  show  that  the  CPC  method  fails  to  analyze  female  speech.  Test 
results  were  obtained  only  from  the  LSL  and  WRLS-VFF  methods. 

Figure  5.22  shows  the  formant  tracking  contours  and  log  magnitude 
plots  for  the  utterance  "five,"  estimated  by  the  closed  phase  LSL 
methods . 

Figures  5.23  and  5.24  show  the  format  tracking  contours  and  log 
magnitude  spectral  plots,  respectively,  for  the  utterance  "five"  and 
"we  were  away,"  estimated  by  the  WRLS-VFF  algorithm. 

Comparing  these  results  with  those  obtained  from  the  analysis  of 
the  male  speech  signal,  we  found  that  the  formant  tracking  contours  are 
not  as  smooth  as  those  of  the  male's.  This  is  because  in  the  male 
voice  analysis,  more  data  (longer  pitch  period)  are  available  for  the 
algorithm  to  converge  and  less  errors  are  obtained. 

Determination  of  the  Model  Order  for  Real  Speech  Analysis 

The  choice  of  the  order  of  an  AR  process  depends  on  the  number  of 
formants  to  be  extracted.  A vocal  tract  resonant  frequency  (formant) 
corresponds  to  a pair  of  complex  conjugate  poles.  For  example,  in 
order  to  obtain  five  formants  from  the  analysis  of  a speech  signal,  a 
minimum  order  of  10  is  required.  In  addition  to  the  10  poles,  at  least 
2 extra  poles  are  needed  to  account  for  the  effect  of  the  source 
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excitation.  Markel  and  Gray  [1976]  suggested  that  it  is  reasonable  to 
choose  the  model  order  for  voiced  speech  analysis  as  the  sampling 
frequency  in  kHz  (e.g.,  an  order  of  10  corresponding  to  a 10  KHz 
sampling  frequency)  with  the  addition  of  several  more  terms  (from  two 
to  five  depending  on  the  desired  results).  According  to  the  above 
general  rules,  possible  values  of  the  AR  model  order  for  voiced  speech 
analysis  are  from  ten  to  sixteen.  Experimental  results  (Figure  5.25) 
showed  that  12th  order  LP  coefficients  were  the  best  choice  for  an  AR 
model  in  terms  of  the  LP  polynomial  root-solving  and  formant  tracking 
for  the  speech  data  in  this  study.  Ve  chose  IP=12  and  IQ=6  for  the 
ARMA  model,  since  three  antiformants  were  assumed  in  the  analysis  of 
nasal  sounds. 

Discussion 

Our  results  have  demonstrated  the  superiority  of  the  closed  phase 
WRLS-VFF  method  of  analysis.  This  method  cannot  only  be  used  for  all- 
pole model  analysis  as  in  the  vowels  and  diphthongs  but  also  can  be 
applied  for  pole-zero  model  analysis  such  as  with  fricatives, 
consonants,  and  nasal  sounds.  The  CPC  method,  which  is  considered  to 
be  better  than  other  LP  methods,  does  not  work  well  when  the  closed 
phase  interval  is  short,  as  in  the  case  with  female  or  children's 
voices,  or  when  the  vocal  tract  characteristics  change  rapidly,  as  with 
the  vowel/consonant  transitions  and  some  glide  sounds.  This  is  because 
the  estimated  covariance  matrix,  obtained  from  a short  data  interval, 
may  give  a biased  estimate  of  the  filter  parameters.  Furthermore, 
matrix  ill-conditioning  problems  may  occur  when  solving  the  least 


squares  equations  in  the  CPC  method. 
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The  adaptive  recursive  algorithms  derived  from  a least  squares 
cost  function  are  known  to  converge  rapidly  (for  short  data  records) 
[Haykin,  1985]  and  have  an  excellent  capability  to  "track"  an  unknown 
parameter  vector.  In  the  proposed  VRLS-VFF  algorithm,  in  addition  to 
the  advantages  of  the  recursive  algorithm,  a variable  forgetting  factor 
is  used  to  allow  the  estimation  process  to  track  the  time-varying 
parameters  more  quickly.  However,  in  the  recursive  lattice  approach 
(WLSL) , the  ability  to  track  the  parameter  variations  degrades  due  to 
the  use  of  a fixed  weighting  factor.  Moreover,  in  the  lattice 

structure,  both  order  and  time  recursions  are  required  in  the 
adaptative  process.  This  results  in  an  increase  in  memory  size  and  is 
impractical  for  real  speech  applications. 

There  are  two  principle  inhibiting  factors  in  the  WRLS-VFF 
algorithm:  computational  complexity  (as  discussed  in  chapter  4)  and 
lack  of  a priori  information  of  the  model  type  and  model  order  for 
parameter  tracking  of  speech  signals.  The  first  of  these  is  being 
ameliorated  with  new  VLSI  technology,  specifically  the  advent  of 
single-chip  processors  such  as  AT&T  DSP32 , NEC  yP7720,  and  TI  TMS320. 
The  second  factor  usually  occurs  when  we  analyze  an  utterance  which 
contains  a succession  of  sounds,  such  as  vowel,  consonant,  and  nasal, 
as  in  the  case  of  "seven."  Some  preprocessing  techniques  such  as 
voiced/unvoiced/silent  classification,  phonetic  segmentation,  and 
feature  measurement  can  be  used  to  estimate  the  model  types  and  model 
order  before  executing  the  VRLS-VFF  algorithm. 
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Figure  5.7  (a)  Formant  tracks,  (b)  spectral  plots  for  the 
utterance  "five"  using  the  CPC  method. 
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Figure  5.8  (a)  Formant  tracks,  (b)  spectral  plots  for  the 
utterance  "five"  using  the  ULSL  method. 
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Figure  5.9  (a)  Formant  tracks,  (b)  spectral  plots  for  the 
utterance  "five"  using  the  VRLS-VFF  method. 
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Figure  5.10  (a)  Formant  tracks,  (b)  spectral  plots  for  the 
utterance  "seven"  using  the  CPC  method. 
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Figure  5.11  (a)  Formant  tracks,  (b)  spectral  plots  for  the 
utterance  "seven"  using  the  WLSL  method. 


FREQUENCY  (HZ) 


122 


3032  - 
43Z0  - 
400Q  — 
3500  — 
3000  - 
2500  - 
2300  -| 
1500  - 
1000  - 
500  - 

0 — 
0 


3 * 3 4 
3 

3 

2 2 2 2 

1111 
0 . 05 


2 2 


222 


1 1 1 1 1 


0 


10 


rORnPNT  TRACKS 


**  -4  -4 


•4  -a  -4  4 4 4 it 


3 3 3 


33  333  3333 


2222222  2222 

2 2 2 2 


ll!  11111111  j.111 

a. is  a . 2a  a. 25  a. 3a  a . 35 
TIME  (SECS) 


(a) 


(b) 

Figure  5.12  (a)  Formant  tracks,  (b)  spectral  plots  for  the 

utterance  "seven"  using  the  VRLS-VFF  (AR)  method. 
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Figure  5.13  (a)  Formant  tracks,  (b)  spectral  plots  for  the 

utterance  "seven"  using  the  VRLS-VFF  (ARMA)  method. 
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Figure  5.14  Formant  tracks  for  the  utterance  "we  were  away" 
using  the  closed  phase  WRLS-VFF  method. 


"we  were  away" 


Figure  5.15  Formant  tracks  for  the  utterance 
using  the  CPC  method. 
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Figure  5.16  Formant  tracks  for  the  utterance  "a  one  mile  lane" 
using  the  closed  phase  WRLS-VFF  (AR)  method. 


Figure  5.17  Formant  tracks  for  the  utterance  "a  one  mile  lane" 
using  the  CPC  method. 
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Figure  5.18  Formant  tracks  for  the  utterance  "a  one  mile  lane" 
using  the  closed  phase  WRLS-VFF  (ARMA)  method. 


Figure  5.19  Anti-formant  tracks  for  the  utterance  "a  one  mile 

lane"  using  the  closed  phase  VRLS-VFF  (ARMA)  method. 
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Figure  5.20  Formant  tracks  for  the  utterance  "Should  we  chase" 
using  the  closed  phase  VRLS-VFF  (AR)  method. 


Figure  5.21  Formant  tracks  for  the  utterance  "Should  we  chase" 
using  the  CPC  method. 
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Figure  5.22  (a)  Formant  tracks,  (b)  spectral  plots  for  the 
utterance  "five"  using  the  WLSL  method. 
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Figure  5.23  (a)  Formant  tracks,  (b)  spectral  plots  for  the 
utterance  "five"  using  the  WRLS-VFF  method. 
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Figure  5.24  (a)  Formant  tracks,  (b)  spectral  plots  for  the  utterance 
"Ve  were  away"  using  the  closed  WRLS-VFF  (AR)  method. 
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Figure  5.25  Formant  tracks  with  different  model  order  for  the 
utterance  "we  were  away"  using  the  LPC  method. 
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Glottal  Inverse  Filtering  by  the  WRLS-VFF  Method 
Introduction 

Voiced  speech  can  be  regarded  as  the  output  of  a linear  system, 
characterized  by  the  vocal  tract  resonances  and  excited  by  a determined 
glottal  wave.  An  important  issue  in  the  areas  of  speech  analysis  and 
synthesis  is  to  separate  the  speech  waveform  into  vocal  tract  and 
glottal  wave  components.  The  process  of  estimating  the  glottal 
waveform  by  removal  of  vocal  tract  resonances  from  speech  signals  is 
known  as  glottal  inverse  filtering  (GIF).  For  GIF  the  speech  signal  is 
analyzed  to  determine  the  acoustic  output  response  of  the  vocal  tract 
to  a glottal  flow  input.  The  response  is  used  to  specify  the 
coefficients  of  an  inverse  filter,  which  is  used  to  filter  the  acoustic 
speech  wave  to  yield  an  estimate  of  the  glottal  flow  wave.  The  inverse 
filter  serves  to  cancel  the  effect  of  vocal  track  resonances  from  the 
speech  signal  to  reveal  the  underlying  voicing  source,  namely,  the 
glottal  volume  velocity  (v-v). 

Estimation  of  the  glottal  v-v  waveform  has  applications  in  the 
areas  of  speech  analysis  and  synthesis.  In  standard  linear  predictive 
analysis  of  speech,  a quasi-periodic  impulse  train  is  assumed  to  be  the 
excitation  of  an  all  pole  system.  The  glottal  shaping  is  accounted  for 
by  the  inclusion  of  two  poles  in  the  system  (since  the  glottal 
excitation  has  an  approximate  -12  dB  roll-off).  The  result  is  that  the 
accuracy  of  the  derived  poles  representing  the  formants  can  be  affected 
[ Steigli tz  and  Dickinson,  1977].  Modeling  the  glottal  excitation 
separately  from  the  vocal  tract  system  allows  for  more  accurate  formant 
estimation  [Veeneman  and  BeMent,  1985].  It  has  also  been  established 
that  the  source  waveform  shape  used  to  drive  a vocoder  has  a 
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significant  influence  on  speech  quality  [Rosenberg,  1971],  and  the  use 
of  the  pulse  shape  derived  from  theoretical  analysis  of  the  glottal 
flov  wave  results  in  an  improvement  in  synthesis  quality  over  simple 
impulse  excitation  [Fant,  1979].  Another  application  is  in  the  study 
of  laryngeal  pathology.  The  flow  of  air  through  the  glottis  (i.e.,  the 
glottal  v-v),  reflects  the  action  of  the  vocal  folds  and  is  thus  an 
important  indicator  of  laryngeal  function  [Deller,  1982].  Any 
abnormalities  of  the  larynx  that  affects  the  vibrational  pattern  of  the 
vocal  folds  and  the  audible  quality  of  the  speech  will  be  evident  in 
the  glottal  v-v  waveform.  However,  the  resonances  created  by  the  vocal 
tract  make  the  degree  of  an  abnormality  difficult  to  quantify  in  the 
resultant  acoustic  speech  waveform.  The  estimation  of  the  glottal  v-v 
waveform  from  acoustic  speech  thus  plays  an  important  role  in 
pathological  voice  analysis. 

Analysis  Methods  Review 

The  concept  of  the  GIF  has  been  of  interest  to  many  investigators 
and  much  success  has  been  obtained  in  achieving  reasonable  estimates  of 
the  glottal  waveform  from  speech  [Berouti,  1976;  Wong  et  al.,  1979; 
Veeneman  and  BeMent,  1985;  Childers  and  Krishnamur thy , 1985; 
Milenkovic,  1986;  Lee,  1988].  We  can  roughly  categorize  these  methods 
into  two  groups,  namely,  two-pass  closed  phase  covariance  (CPC)  method 
[Wong  et  al.,  1979;  Lee,  1988],  and  two-channel  (speech  and  EGG)  CPC 
method  [Childers  and  Krishnamurthy , 1985;  Veeneman  and  BeMent,  1985]. 

In  the  two-pass  method,  Wong  et  al.  [1979]  suggested  a straightforward 
(but  computationally  expensive)  approach  for  analyzing  the  normalized 
linear  prediction  error  sequence.  This  was  done  by  calculating  the  p- 
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pole  total  linear  prediction  error  on  an  M-point  window  of  the  speech 
waveform.  The  window  is  moved  through  the  speech  waveform  one  point  at 
a time  and  after  energy  normalization  the  total-error  sequence 
represents  a measure  of  the  fit  of  a p-pole  model  to  segments  of  the 
waveform.  The  closed  phase  region  is  determined  where  the  total  error 
is  at  a minimum  (ideally,  zero).  Lee  [1988J  simplifies  the  procedure 
for  estimating  the  closed  phase  interval  by  locating  the  glottal 
closure  point  using  the  LP  residual  error.  In  the  two  channel  method, 
the  instant  of  glottal  closure  and  the  duration  of  the  closed  glottal 
phase  are  determined  using  the  EGG  signal.  After  the  glottal  closed 
phase  has  been  determined,  the  vocal  tract  filter  coefficients  are 
obtained  by  using  the  covariance  analysis  technique.  Finally,  by 
inverse  filtering  the  speech  signal  with  a proper  cancellation  of  the 
radiation  factor,  we  obtain  the  glottal  v-v. 

Wong's  method  is  sufficient  for  low  frequency  normal  speech  with 
a well  defined  closed  phase  [Veeneman  and  BeMent,  1985].  However,  in 
cases  of  high  frequency  or  breathy  speech  where  the  closed  phase  is  of 
shorter  duration,  this  method  may  not  provide  unambiguous  local  minimal 
ranges  of  the  normalized  error  sequence  that  are  necessary  to  indicate 
closed  phase.  This  problem  occurs  when  the  duration  of  the  closed 
phase  is  less  than  the  length  of  the  analysis  window.  In  Lee's  method 
[1988],  sometimes  the  "pseudo  closed  phase"  does  not  exactly  match  the 
real  closed  phase  and  this  may  result  in  biased  estimates  using  the 
covariance  method.  The  two-channel  method  can  provide  better  results 
but  fails  when  the  speech  signals  contain  an  incomplete  or  no  closed 
phase.  This  fact  will  be  seen  in  the  experimental  results  in  this 


study. 
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Glottal  Inverse  Filtering  by  the  URLS  Method 

The  basic  idea  of  the  WRLS-VFF  method  for  GIF  is  to  use  the 
variable  FF  to  identify  the  glottal  closure  points,  which  correspond 
to  the  instants  of  the  main  excitation  pulses  for  voiced  speech.  As 
can  be  seen  from  Figure  5.4,  the  smallest  value  of  A^  in  the  WRLS-VFF 
method  consistently  matches  the  negative  peaks  of  the  DEGG  signal, 
which  are  located  very  close  to  the  instant  of  glottal  closure 
[Childers  and  Krishnamurthy , 1985]. 

A block  diagram  of  the  WRLS-VFF  method  for  GIF  is  shown  in  Figure 
5.26.  During  the  analysis,  the  A^  can  be  obtained  sequentially 
(sample-by-sample).  The  location  of  the  excitation  pulses  and  pitch 
periods  were  detected  by  using  a threshold  value  of  Aq  and  comparing 
with  each  A^.  The  Aq  can  be  determined  empirically.  The  value  used  in 
this  study  was  0.1.  The  time  to  store  the  filter  coefficients  for 
analysis  is  determined  when  the  adaptative  process  converges.  The 
minimal  data  length  for  convergence  of  the  WRLS  algorithm  is  equal  to 
twice  the  filter  order  [Haykin,  1985].  The  time  of  convergence  can 
also  be  determined  by  checking  the  estimation  error  or  the  variable 
forgetting  factor. 

The  formant  resonances  of  the  vocal  tract  were  estimated  by 
solving  the  roots  of  the  LP  polynomial,  and  then  shaping  the  formant 
structure  by  empirical  rules,  which  included:  1)  discarding  the  roots 
with  center  frequencies  under  250Hz,  2)  discarding  the  roots  with  Q 
less  than  one,  and  3)  merging  two  adjacent  roots.  The  refined  formant 
resonances  were  then  used  to  construct  the  vocal  tract  transfer 
function,  which  was  used  in  the  final  GIF  procedure.  The  direct  output 
of  the  GIF  operation  is  a differential  glottal  v-v  waveform,  g'n.  A 
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Figure  5.26  Block  diagram  of  the  VRLS-VFF  method  for  glottal 
inverse  filtering  analysis. 
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glottal  v-v  waveform,  gn,  is  derived  by  carrying  out  an  integration  on 
g'n  to  cancel  out  the  effect  of  the  lip  radiation. 

Comparison  of  Different  GIF  Methods 

The  performance  of  three  GIF  methods  (two-pass,  two-channel  and 
VRLS-VFF)  were  evaluated  by  analyzing  different  types  of  speech  signals 
such  as  modal  voice  (medium  pitch)  and  falsetto  voice  (high  pitch). 
The  characteristics  of  the  test  speech  signals  are  listed  in  Table  5.1. 
The  speech  and  EGG  signals  produced  by  phonating  the  sustained  vowel 
/a/  and  /i/  from  a male  speaker  DMH  and  a female  speaker  DDH  are  shown 
in  Figures  5.27  (modal  voice),  5.28  (falsetto  voice),  5.29  (modal 
voice),  and  5.30  (pathological  voice),  respectively. 

For  the  modal  voice  a closed  phase  interval  was  present  and  all 
the  three  methods  were  able  to  extract  the  glottal  v-v  waveform  from 
the  speech.  Figures  5.31,  5.32,  and  5.33  show  the  estimated  glottal  v- 
v waveform,  gn,  and  its  derivative  (gn')  for  the  two-pass,  two-channel, 
and  VRLS-VFF  methods,  respectively.  However,  for  the  two-pass  method, 
some  high  frequency  components  appear  in  the  closed  phase  region, 
presumably  due  to  an  incomplete  cancellation  of  the  vocal  tract 
transfer  function.  For  the  two-channel  method,  a smoothed  glottal  v-v 
can  be  obtained,  while  there  are  a few  intervals  where  this  method 
failed.  Experimental  data  show  that  the  estimated  bandwidths  in  these 
intervals  are  too  narrow  to  construct  an  accurate  vocal  tract  transfer 
function  which  is  essential  for  GIF.  For  the  VRLS-VFF  method,  a 
continuous  glottal  v-v,  synchronous  with  the  EGG  signal,  can  be 
These  glottal  v-v  waveforms  agree  with  the  expected 


obtained . 
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Table  5.1  Test  speech  data  /a/  and  / i / . 


Speech 

Types 

Speech 

Data 

EGG 

Data 

Pitch 

Period 

Closed 

Phase 

Speaker 

Modal  Voice 
/a/ 

KMDS1 

KMDE1 

7 . 9 ms 

3.4  ms 

DMH 

(male) 

Falsetto 

/a/ 

KFLS1 

KFLE1 

6.4  ms 

NO 

DMH 

(male) 

Modal  Voice 

/i/ 

DHS3 

DHE3 

7.2  ms 

4.0  ms 

DMH 

(male) 

Pathological 
Voice  / i / 

JJS3 

JJE3 

4.8  ms 

NO 

JJS 

( female) 
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characteristics  for  the  glottal  excitation  source  such  as  a flat  closed 
region  and  a sharp  slope  at  closure. 

For  falsetto  voice  analysis,  due  to  the  high  pitch  and  incomplete 
glottal  closure,  the  two-channel  method  failed  to  analyze  the  signal. 
The  two-pass  method  can  only  estimate  the  glottal  v-v  waveform  for  a 
few  intervals.  Figure  5.34  shows  the  estimates  glottal  v-v  and  gn' 
using  the  two-pass  method.  The  WRLS-VFF  method,  as  long  as  the  pitch 
period  is  longer  than  twice  the  filter  order,  can  give  a good  estimate 
of  the  glottal  v-v  waveform.  This  can  be  seen  from  Figure  5.35.  The 
two-pass  and  WRLS-VFF  methods  have  also  been  tested  on  other  speech 
signals  such  as  modal  and  pathological  voice  /i/.  The  results  are 
shown  in  Figures  5.36,  5.37,  and  5.38,  respectively.  A summary  of  the 
experimental  results  for  different  GIF  analysis  methods  is  given  in 
Table  5.2. 

Discussion 

A WRLS-VFF  method  has  been  presented  in  this  section  to  estimate 
the  glottal  v-v  waveform  by  the  inverse  filtering  of  the  acoustic 
speech  signal.  This  method  does  not  require  an  auxiliary  EGG  signal  as 
in  the  case  of  the  two-channel  method  nor  does  it  need  another  pass  to 
locate  the  position  of  the  excitation  pulses  as  in  the  case  of  the  two- 
pass  method.  It  can  provide  reliable  glottal  v-v  waveform  estimates 
automatically  from  the  speech  signal.  Experimental  results  show  that 
this  WRLS-VFF  method  is  capable  of  estimating  the  glottal  v-v  waveform 
with  high  accuracy  for  high  pitch  speech  as  well  as  speech  without  a 
glottal  closed  phase.  Aside  from  the  superiority  of  this  method,  some 
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Table  5.2  Comparison  of  different  GIF  analysis  methods 


Algori thms 

Two-channel 

Two-pass 

WRLS-VFF 

Accuracy  of 

Formant 

Estimation 

depends  on 
the  closed 
phase 

depends  on 
the  "pseudo 
closed  phase" 

very  good 

Computational 

Load 

low 

medium 

high 

Stability 

data 

dependent 

data 

dependent 

no  stability 
problems 

High  Pitch  or 
Incomplete 
Closed  Phase 

failed 

sometimes 

failed 

did  not 
fail 

Convenience 

needs 
auxiliary 
signal  (EGG) 

easy  to 
implement 

easy  to 
implement 

Execution 

batch 

processing 

batch 

processing 

sequential 

processing 
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Figure  5.27  Speech  and  EGG  waveforms  for  the  modal  voice 
/a/  spoken  by  DMH. 


Figure  5.28  Speech  and  EGG  waveforms  for  the  falsetto  voice 
/a/  spoken  by  DMH. 
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Figure  5.29  Speech  and  EGG  waveforms  for  the  modal  voice 
/i/  spoken  by  DMH. 


Figure  5.30  Speech  and  EGG  waveforms  for  the  pathological 
voice  /i/  spoken  by  JJS. 
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Figure  5.31  Estimated  glottal  v-v  waveform  (upper)  and  D-glottal 
v-v  waveform  (lower)  for  the  modal  voice  /a/  using  the 
two-pass  method. 


Figure  5.32  Estimated  glottal  v-v  waveform  (upper)  and  D-glottal 

v-v  waveform  (lower)  for  the  modal  voice  /a/  using  the 
two-channel  method. 
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Figure  5.33  Estimated  glottal  v-v  waveform  (upper)  and  D-glottal 

v-v  waveform  (lower)  for  the  modal  voice  /a/  using  the 
VRLS-VFF  method. 
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Figure  5.34  Estimated  glottal  v-v  waveform  (upper)  and  D-glottal 
v-v  waveform  (lower)  for  the  falsetto  /a/  using  the 
two-pass  method. 


Figure  5.35  Estimated  glottal  v-v  waveform  (upper)  and  D-glottal 
v-v  waveform  (lower)  for  the  falsetto  /a/  using  the 
WRLS-VFF  method. 
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Figure  5.36  Estimated  glottal  v-v  waveform  (upper)  and  D-glottal 

v-v  waveform  (lower)  for  the  modal  voice  /i/  using  the 
two-pass  method. 


Figure  5.37  Estimated  glottal  v-v  waveform  (upper)  and  D-glottal 

v-v  waveform  (lower)  for  the  modal  voice  /i/  usine  the 
WRLS-VFF  method. 
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Figure  5.38  Estimated  glottal  v-v  waveform  (upper)  and  D-glottal 
v-v  waveform  (lower)  for  the  pathological  voice  /i/ 
using  the  VRLS-VFF  method. 
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practical  factors  need  to  be  considered  when  executing  the  GIF 
analysis . 

Filter  Order  (IP).  Our  experimental  results  showed  that,  for  a 
voiced  speech  signal  with  five  formant  resonances,  the  acceptable 
minimum  IP  was  12.  Thus,  in  addition  to  the  10  poles  required  for  the 
simulation  of  the  formant  resonances,  at  least  2 extra  poles  were 
needed  to  account  for  the  effect  of  the  source  excitation.  The  optimal 
IP,  which  was  unknown  before  the  inverse  filtering  analysis,  was 
dependent  on  the  characteristics  of  the  actual  excitation  wave.  Two 
extra  poles  were  found  to  be  enough  to  account  for  an  excitation  wave 
with  a spectral  roll-off  rate  not  higher  than  12  dB/octave.  A higher 
order  of  LP  filter  did  not  improve  the  performance. 

Preemphasis  of  the  Speech.  The  speech  was  preemphasized  using  a 
high-pass  filter.  Theoretically,  a true  closed  phase  analysis  does  not 
require  preemphasis  [Krishnamurthy , 1983].  However,  in  practical 
situations  the  position  to  update  the  filter  parameters  for  analysis  in 
the  WRLS-VFF  method  may  not  be  within  the  closed  phase.  Experimental 
results  in  this  study  showed  that  the  use  of  a preemphasis  filter  often 
reduces  the  estimation  error,  especially  in  the  first  formant/bandwidth 
estimation. 

Speech  Signal  Recording.  Lee  [1988]  pointed  out  that  it  is 
necessary  to  collect  the  speech  signal  without  magnitude  and  phase 
distortion  throughout  the  frequencies  of  interest  (DC  to  1 KHz)  for 
estimating  a natural  glottal  v-v  waveform.  In  our  experiments  of 
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formal  inverse  filtering  analysis,  only  the  speech  signals  recorded  by 
a condenser  microphone  (B&K  4113)  with  a good  low-frequency  response 
followed  by  a direct  digital  conversion  were  used. 

An  ARMA  Cepstral  Distance  Measure  Based  on  the  VRLS-VFF  Algorithm 
Introduction 

An  important  aspect  in  parametric  modeling  of  speech  is  the 
criteria  for  judging  the  quality  of  representation  for  a particular  set 
of  parameters.  Distance  (also  called  distortion)  measures,  based  upon 
such  parameter  sets  as  filter  coefficients,  reflection  coefficients, 
and  cepstral  coefficients  which  retain  the  smoothed  spectral  behavior 
of  the  speech  signal,  have  been  applied  in  speech  recognition  [Itakura, 
1975;  Nocerino  et  al.,  1985],  speaker  identification  and  verification 
tasks  [Atal,  1974;  Wakita,  1976].  The  current  available  objective 
distance  measures  such  as  Itakura-Saito  distance  [Itakura  and  Saito, 
1968],  log  likelihood  ratio  distance  [Gray  and  Markel,  1976],  and  LPC 
cepstral  distance  [Atal,  1974]  have  been  based  on  the  linear  prediction 
theory,  which  assumes  that  the  speech  production  process  is  an  all-pole 
model.  These  measures  compare  two  speech  segments  in  terms  of  their 
respective  spectra.  The  spectra  are  derived  from  the  corresponding 
sets  of  LPC  (AR)  coefficients.  In  some  cases,  the  speech  production 
process  might  be  considered  as  an  ARMA  process  (e.g.,  nasalized  sounds 
or  noise  corrupted  vowels  [Done  and  Rushforth,  1979]).  These  distance 
measures  based  on  the  AR  model  coefficients  may  not  give  an  accurate 
representation  of  the  spectral  distance  of  the  analyzed  signals.  In 
this  section,  an  ARMA  cepstral  distance  is  derived  from  the  WRLS-VFF 
algorithm.  This  ARMA  cepstral  distance  measure  and  a corresponding  AR 
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cepstral  distance  measure  were  tested  and  compared  for  the  synthetic 
speech  signal  SYN5 . 

Properties  of  the  Spectral  Distance  Measure 

Let  d(f,f)  denote  the  distance  measure  between  two  segments  of 
speech  with  spectra  f and  f.  Then  a good  distortion  measure  should 
have  the  following  properties: 

(1)  d(f,f)  is  non-negative; 

(2)  d(f,f)  should  have  a reasonable  perceptual  interpretation; 

(3)  should  be  mathematically  tractable  and  easy  to  compute. 

The  first  property  (i.e.,  positive  definiteness),  is  a 
requirement  for  a valid  distance  measure.  The  second  property  is 
important  because  it  links  the  signal  processing  to  the  human  speech 
perception  process.  The  third  property  is  a practical  one  for 
implementation  in  a computation  bound  system  such  as  a speech 
recognizer. 

Spectral  Distance  Measure  and  RMS  Log  Spectral  Distance  Measure 

Consider  two  spectra  S(eJ®)  and  S(ed®)  of  two  segments  of  speech 
s(n)  and  s(n)  respectively.  The  error  or  difference  between  these 
spectra  on  a log  magnitude  versus  frequency  scale  is  defined  by 

V( 9)  = In  |S(eJ e)  p _ in  |s(eJ  e|2  (5.14) 

where  0 is  a normalized  frequency  or  angle  in  the  z plane. 

A generalized  frequency  domain  distance  measure  between  the 
spectra  is  the  set  of  Lp  norms  or  measures  defined  by  dp,  where 
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n 

dp  = ( 1/2 it)  J |V(0)  |P  d9 
-n 


(5.15) 


For  p=2  the  RMS  log  spectral  distance  measure  is  defined.  For  large 
values  of  p,  large  distances  are  weighted  heavier  than  small  distances. 
For  the  limiting  case  as  p approaches  infinity,  the  peak  log  spectral 
distance  is  obtained. 


Parametric  Distance  Measure 

A parametric  distance  measure  is  useful  when  comparing  the 
performance  of  a parameter  estimation  algorithm  to  a reference  set  of 
parameters.  If  pi  and  ^ denote  the  ith  reference  and  ith  estimated 
parameter,  respectively,  then  the  Euclidean  distance  between  these  two 
parameter  sets  may  be  expressed  as 

N 

d(N)  = £ (Pi  - Pi)2  (5.16) 

i = l 

where  the  N denotes  the  number  of  truncated  measures. 

Markel  and  Gray  [1976]  have  shown  that  this  truncated  parametric 
distance  measure  computed  using  the  cepstral  coefficients  is  an  accurate 
estimate  of  the  L2  norm  log  spectral  distance.  Their  results  showed 
that  for  an  AR(10),  AR(20),  and  AR(30)  analysis,  the  correlation 

coefficient  between  the  L2  log  spectral  distance  measure  (d2)  and 
truncated  cepstral  distance  measure  (dcep)  was  .98,  .997,  .999 

respectively.  This  indicates  that  the  first  20-30  cepstral  coefficients 
may  be  used  to  approximate  the  L2  norm  at  a considerable  savings  in 
computation. 
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For  the  pole/zero  estimation  of  the  speech  production  system,  the 
RMS  log  spectrum  obtained  from  the  ARMA  estimation  algorithm  can  also  be 
approximated  by  the  cepstral  coefficients,  provided  the  estimated  poles 
and  zeros  are  within  the  unit  circle  [Shoie  and  Wygonshi,  1987].  The 
following  section  will  discuss  how  to  derive  a set  of  cepstral 
coefficients  from  the  estimated  ARMA  parameters. 

Coefficients  Transformation 

The  AR,  ARMA,  and  cepstral  parameter  estimation  algorithms  yield  a 
set  of  parameters  that  model  the  spectrum  of  the  speech  segment.  The 
cepstral  parameters  used  in  the  (5.16)  can  also  be  transformed  from  AR 
(LPC)  or  ARMA  parame  ters  recursively. 


AR  to  cepstral  coefficients  transformation.  Assuming  the  speech 
production  process  can  be  expressed  as  an  AR  model 

P 

sn  1 ak  sn-k  + G un  (5.17) 

k = l 


The  set  of  cepstral  parameters  c0,  cj,...,cn  may  be  computed  from 
the  set  of  LPC  coefficients  by  the  relations  [Markel  and  Gray,  1976] 


c0  = In 
ci  = -ai 


n-1 

cn  = (1  - k/n)  ak  cn-k  - an>  1 < n < p 


c„  = 


P 

-E 

k=l 


(1  - k/n)  ak  cn_k, 


p < n 


(5.18) 


153 


ARMA  to  cepstral  coefficients  transformation.  A speech  signal  sn 
generated  from  a speech  production  model  (SPM)  is  said  to  be  an  ARMA 

process  of  order  (p,q)  if  it  can  be  represented  by  the  difference 
equation 

P q 

sn  = " z *i  sn_i  + I dj  un_-j  (5.19) 

i=l  j=0 

where  uj  is  a stationary  white  noise  process  with  zero  mean  and  variance 
a2.  The  transfer  function  H(z)  of  the  SPM  is  then  given  by 

H(z)  = G B(z)  / A(z) 

9 P 

= G (1  + I bj  z J)  / (1  + E aj  z-1)  (5.20) 

i=l  i=l 

where  G is  the  gain  and  bj  is  the  normalized  coefficients  of  the  MA 
filter,  namely,  G=b0,  bi=d^/b0,  etc.  If  all  the  poles  of  H(z)  are 
inside  the  unit  circle,  then  the  Fourier  series  expansion  for  the  model 
spectrum  can  be  written  as 

In  |H(z)|2  = In  G2  + 2 In  |B(z)/A(z)  | 

00 

= E ck  z-k 
_ 00 

00 

= co  + 2 z ck  z~k  (5.21) 

k=l 

To  obtain  the  relationship  between  cepstral  coefficients  cn  and  ARMA 
parameters  ak,  bk,  differentiating  both  sides  of  (5.21)  with  respect  to 
z_1  gives 
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dz 


-1 


1 + E bk  z“ 
k=l 


d ® 


In 


p dz-1  n=l 

1 + E ak  z_1 
k=l 


E cn  z 


-n 


(5.22) 


which  is  simplified  to 


Ik  bk  zk+1 
k=l 


1 + E bk  z 
k=l 


-k 


E k ak  z k+l 
k=l 


= E n cn  z-1 


p n=l 

1 + E ak  z_k 
k=l 


(5.23) 


and  rewritten  as 


q p q p 

( E k bk  z~k+1)  (1  + E ak  z-k)  - (1  + E bk  z~k)  ( E k ak  z"k+1) 
k=l  k=l  k=l  k=l 


® q p 

= ( E n cn  z_n+1)  (1  + E bk  z"k)  (1  + E ak  z“k)  (5.24) 


n=l 


k=l 


k=l 


If  we  now  equate  the  constant  term  and  the  various  powers  of  z~ * 
on  the  left  and  right  hand  sides  of  (5.24),  we  obtain  the  desired 
relationship  between  cn's  and  an's,  bn's,  namely 

In  G2 

P 

-an  - 2 (1  - k/n)  ak  cn_k 

k=l 

q 

- E (1  - 2j/n)  b-:  an_-: 

j = l 

q p 

- E E (1  - k/n  - j/n)  ak  b-j  cn  k ^ 

j=l  k=0  3 ^ 


(5.25) 
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In  this  expression  it  is  assumed  that  a0=l,  a{^=0  outside  the  range  l...p 
and  bj=0  outside  the  range  l...q. 

Performance  Evaluation  of  the  ARMA  Cepstral  Distance  Measure 

The  criteria  to  evaluate  a cepstral  distance  should  satisfy  the 
following  conditions: 

(a)  The  smoothed  spectrum  obtained  from  the  truncated  cepstral 
coefficients  should  have  the  same  spectral  shape  (poles  and  zeros)  as 
the  log  magnitude  spectrum  of  the  original  signal. 

(b)  The  correlation  coefficient  between  the  truncated  ARMA 
cepstral  distance  measure  and  the  L2  log  spectral  distance  measure 
should  be  close  to  unity.  This  means  we  can  use  the  ARMA  cepstral 
distance  measure  to  represent  the  spectral  property  instead  of  using 
FFTs  and  logarithms  to  compute  the  signal  spectrum  [Gray  and  Markel, 
1976] . 

The  proposed  ARMA  distance  measure  was  evaluated  and  compared 
based  on  these  two  conditions.  A synthetic  speech  signal,  /m/,  ( SYN3) , 
which  consists  of  3 poles  and  one  zero,  was  used  to  evaluate  the 
performance  of  the  AR  and  ARMA  cepstral  coefficients  transformation.  In 
order  to  calculate  the  correlation  coefficients  for  the  ARMA  cepstrum,  a 
synthetic  speech  signal  SYN3A  with  slightly  different  formants  and 
antiformants  from  SYN3  (see  Figure  5.39)  were  also  generated. 

The  following  test  procedures  were  used  in  the  performance 
evaluation: 

(1)  The  WRLS-VFF  algorithm  was  used  to  estimate  the  ARMA  (6,2) 
parameters  (aif  bi)  from  the  synthetic  speech  SYN3. 

(2)  The  proposed  ARMA  to  cepstral  coefficients  transformation  in 
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equation  (5.25)  was  used  to  calculate  the  cepstral  coefficients  for 
N=12,  24,  and  36  respectively. 

(3)  The  impulse  response  and  the  smoothed  spectra  were  calculated 
from  the  truncated  cn. 

The  same  procedures  were  executed  for  the  AR  case  while  using  the 
LPC  method  (p=12  and  p=16)  and  equation  (5.18).  The  reason  for  using 
high  order  LPC  filter  is  to  approximate  the  zeros  in  the  signal 
spectrum. 

Experimental  Results 

(a)  Figure  5.40  shows  the  original,  AR,  and  ARMA  spectra  for  the 
synthetic  signal  SYN3,  respectively. 

(b)  Figures  5.41  and  5.42  show  the  smoothed  spectra  obtained  from 
the  truncated  c^  based  on  the  AR(12)  and  AR(16)  to  cepstral  coefficients 
transformation  as  in  equation  (5.18). 

(c)  Figure  5.43  shows  the  smoothed  spectra  obtained  from  the 
truncated  cn  based  on  the  ARMA  to  cepstral  coefficients  transformation 
as  in  equation  equation  (5.25). 

(d)  Table  5.3  indicates  the  correlation  coefficients  between  the 
RMS  log  spectral  distance  (d2)  and  the  ARMA  cepstral  distance. 

Discussion 

From  the  experimental  results,  the  proposed  ARMA  cepstral  distance 
measure  can  give  a better  representation  of  the  original  spectrum  than 
that  of  an  AR  cepstral  distance  measure  when  the  analyzed  signal  is 
generated  by  a pole-zero  model.  The  cepstral  coefficients  can  be 
recursively  obtained  from  ARMA  parameters  which  are  estimated  from  the 
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Table  5.3  ARMA(6,2)  distance  measure,  ARMA  log 
RMS  spectral  distance  (d2)  = 2.80  dB. 


N 
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16 

24 

32 

40 
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.94 

.98 
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Figure  5.39  (a)  Spectrum  of  the  synthetic  signal  SYN3. 

(b)  Spectrum  of  the  synthetic  signal  SYN3A. 
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(a) 


re  5.40  (a)  Original  spectrum,  (b)  estimated  spectrum  with 

ARMA(6 , 2) , and  (c)  estimated  spectrum  with  AR(12)  of 
the  synthetic  signal  SYN3. 


PR  SPECTRUM  FROM  c C n 3 


Figure  5.41  Smoothed  spectra  obtained  from  the  trancated  cn 
with  AR ( 1 2 ) , (a)  N=16,  (b)  N=24,  and  (c)  N=36. 
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Figure  5.42  Smoothed  spectra  obtained  from  the  trancated  cn 
with  AR( 16) , (a)  N=16,  (b)  N=24,  and  (c)  N=36. 


ORMO  SPECTRUM  FROM  c C m ) 


Figure  5.43  Smoothed  spectra  obtained  from  the  trancated  cn 
with  ARMA(6, 2) , (a)  N=16,  (b)  N=24  and  (c)  N=36. 
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WRLS-VFF  algorithm.  Furthermore,  the  correlation  coefficients  between 
the  truncated  ARMA  distance  measure  and  the  L2  log  spectrum  are  close  to 
unity  with  large  N (see  Table  5.3).  This  indicates  that  with  a finite 
set  of  ARMA  cepstral  coefficients,  we  can  represent  the  L2  log  spectrum 
accurately. 

The  cepstral  coefficients  used  for  speaker  identification, 
verification,  or  gender  recognition  have  the  additional  advantage  that 
one  can  derive  from  them  a set  of  parameters  that  are  invariant  to  fixed 
frequency-response  distortions  introduced  by  the  recording  apparatus  or 
the  transmission  system  [Atal,  1974].  The  new  parameters  are  obtained 
simply  by  subtracting,  from  the  cepstral  coefficients,  a set  of  values 
representing  their  time  averages  over  the  duration  of  the  entire 
utterance.  If  the  frequency  response  of  the  microphone  and  the 
transmission  system  both  have  poles  and  zeros,  the  proposed  WRLS-VFF 
method  plus  the  ARMA  cepstral  distance  measure  will  give  robust  results 
for  the  applications  mentioned  above. 


CHAPTER  6 


CONCLUSIONS  AND  DISCUSSIONS 
Summary 

A new  weighted  recursive  least  squares  method  with  a variable 
forgetting  factor  (VRLS-VFF)  for  ARMA  parameter  estimation  was 
developed.  This  method  was  derived  using  the  least  squares  approach 

and  incorporated  the  adaptive  system  identification  techniques.  A 
recursive  equation  to  update  the  forgetting  factor  was  also  derived  by 
using  the  weighted  sum  of  the  squares  of  residual  errors.  This  VFF  was 
shown  to  control  the  effective  memory  length  for  the  adaptive  process 
and  provide  information  to  estimate  the  input  driving  source,  which  was 
either  white  noise  or  a sequence  of  pulses.  The  performance  of  the 
WRLS-VFF  algorithm  has  been  tested  using  synthetic  speech  signals. 
Experimental  results  showed  that  the  accuracy  of  the  estimated  signal 
parameters  (AR  and  ARMA  models)  and  the  ability  to  track  the  parameter 
variations  for  nonstationary  processes  were  superior  to  both  block- 
processing techniques  such  as  linear  predictive,  interative 
prefiltering,  and  modified  Yule-Valker  equation  methods  and  other 
recursive  methods,  such  as  Kalman  filtering,  standard  WRLS,  and 
weighted  least  squares  lattice.  Other  outcomes  of  this  study  were  the 
application  of  the  WRL-VFF  algorithm  for  speech  signal  analysis  (e.g., 
spectral  (formant  and  antiformant)  tracking,  glottal  inverse  filtering 
and  ARMA  distance  measures). 
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Formant  and  antiformant  tracking  for  speech.  A closed  phase 
WRLS-VFF  method  was  developed  to  track  the  spectral  peaks  (formants)  as 
well  as  spectral  valleys  (antiformants)  for  different  kinds  of  speech 
signals  such  as  vowels,  diphthongs,  nasals  and  consonants  in  either 
isolated  words  or  sentences.  This  method  is  better  than  other  formant 
tracking  methods  in  terms  of  the  ability  to  track  the  formant 
variations  and  the  smoothness  of  the  estimated  bandwidth  tracks. 
Moreover,  the  WRLS-VFF  (ARMA)  algorithm  can  also  track  the  antiformants 
when  analyzing  nasal  sounds,  which  most  of  the  existing  formant 
tracking  techniques  cannot  provide. 

Glottal  inverse  filtering.  A new  glottal  inverse  filtering 
technique  using  the  WRLS-VFF  algorithm  was  proposed.  This  method  uses 
the  variable  forgetting  factor  and  the  estimation  error  to  select  the 
parameter  updating  position  that  excludes  the  region  of  the  main 
excitation  pulse.  This  results  in  an  increase  in  the  accuracy  of  the 
estimated  vocal  tract  transfer  function  and,  in  turn,  the  inverse 
filtered  glottal  volume-velocity  (v-v)  waveform.  The  proposed 
algorithm  was  compared  with  the  two-channel  (speech  and  EGG)  method 
and  the  two-pass  method  for  analyzing  different  types  of  speech 
signals.  Results  showed  that  the  proposed  method  was  capable  of 
estimating  the  glottal  v-v  waveform  both  for  the  high  pitch  speech 
signal  and  for  the  speech  signal  without  a completed  glottal  closed 
phase  (e.g.,  pathological  voices). 

ARMA  distance  measures.  A recursive  formula  for  the  transfer  of 
ARMA  coefficients  to  a set  of  cepstral  coefficients  was  derived.  The 
ARMA  coefficients  were  obtained  from  the  WRLS-VFF  method.  This  set  of 
cepstral  coefficients  can  be  used  for  an  ARMA  distance  measure  which 
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has  a better  physical  representation  of  the  spectral  properties  than 
the  AR  (LPC)  distance  measure.  Simulation  results  showed  that  this 
ARMA  distance  measure  was  close  to  the  log  magnitude  distance  measure 
with  a large  N,  where  N is  the  number  of  cepstral  coefficients.  This 
ARMA  distance  measure  can  be  used  for  many  applications  where  the 
signal  spectral  valleys  should  be  taken  into  consideration. 

Directions  of  Future  Research 

The  adaptive  estimation  of  time-varying  signal  parameters, 
especially  under  a nonstationary  environment,  is  important  for  many 
applications.  The  method  derived  in  this  study  was  applied 
successfully  to  speech  signal  analysis.  However,  there  are  still  some 
unsolved  problems  which  limit  the  performance  of  the  proposed 
algorithm.  Future  research  is  suggested  in  several  areas. 

(1)  An  important  factor,  which  affects  the  accuracy  of  the 
estimated  parameters,  is  the  selection  of  the  model  type  (AR  or  ARMA) 
and  model  order.  In  this  study  preprocessing  and  historical 

information  of  the  speech  signal  were  used  to  determine  the  model  type 
and  model  order.  These  parameters  remained  unchanged  during  the 
process  of  adaptation.  However,  in  many  situations  the  model  type  and 
model  order  changes  from  time  to  time  during  the  process  of  speech 
production.  Therefore,  the  process  of  selecting  and  updating  these  two 
parameters  when  they  are  changing  is  still  an  unsolved  problem. 
Techniques  such  as  feature  extraction,  phonetic  segmentation,  and  four 
way  (silent/voiced/unvoiced/mixed  excitation)  classification  [Childers 
et  al.,  1989]  are  suggested  to  be  used  in  the  selection  of  the  model 
type  and  model  order  during  the  adaptive  process. 
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(2)  The  main  disadvantage  of  the  proposed  method  is  the  high 
computational  load.  The  adaptive  parameter  estimation  method  based  on 
the  least  squares  lattice  ( LSL)  [Lee  et  al.f  1982]  is  rapid 
convergence,  robustness,  and  computational  efficiency.  However,  the 
weighting  (forgetting)  factor  used  in  this  algorithm  is  kept  constant. 
One  way  to  improve  the  performance  of  an  adaptive  ARMA  parameter 
estimation  method  is  to  apply  the  variable  forgetting  concept  to  the 
LSL  algorithm,  which  needs  to  be  developed. 

(3)  In  speech  signal  analysis,  the  ability  to  extract  an  accurate 
glottal  volume-velocity  (v-v)  waveform  can  serve  a valuable  function  in 
the  study  of  the  excitation  model  for  speech  synthesis  systems.  The 
glottal  v-v  obtained  from  the  WRLS-VFF  method  is  suggested  as  a method 
to  verify  various  glottal  excitation  models,  such  as  the  Fant,  LF,  and 
Fuj isaki-Ljungqvist  [1986]  models  for  different  kinds  of  speech 
signals,  especially  in  the  analysis  of  female  or  nasal  voices,  where 
most  of  the  existing  methods  do  not  work  well. 

(4)  Spectral  distance  (distortion)  measures  are  commonly  used  to 
measure  the  similarity  (spectral  matching)  between  two  given  short  time 
spectra.  Distance  measures  such  as  the  I takura-Sai to  distance  measure, 
the  log  likelihood  distance  measure,  and  cepstral  distance  measure  have 
been  applied  to  gender  recognition,  speaker  verification,  and  speech 
recognition  systems.  However,  all  of  these  distance  measures  are  based 
on  the  linear  predictive  coding  method,  which  is  an  all-pole  model.  It 
is  suggested  that  an  ARMA  distance  measure  may  be  used  to  improve  the 
performance  of  those  applications,  especially  when  spectral  zeros  exist 
in  the  underlying  signals. 


APPENDIX  A 


DERIVATION  OF  THE  WRLS  ALGORITHM  FOR  ARMA  PARAMETER  ESTIMATION 


To  derive  the  recursive  least  squares  algorithm  for  estimating 
the  ARMA  parameters,  we  use  the  same  procedure  as  in  the  derivation  of 
an  AR  VRLS  algorithm  [Covan  and  Grant,  1985].  The  same  result  can  be 
found  in  [Friedlander,  1982;  Miyanaga  et  al.,  1982].  Let  us  consider 
the  situation  where  we  make  just  one  extra  observation  at  the 
instant  yjc_j,  then  from  (3.11)  we  have 

*k  ^k  Yk  = x *k-l  \-l  Yk-1  + ^k  yk  (Al) 

Defining 

Pk  = [*k  \ VH  (A2) 

we  have 


pk  - tX  Pfc.i  1 + ^ <f>kt]  1 (A3) 

The  matrix  inverse  in  (A3)  can  be  written  as 

pk  = X_1  tpk-l  + pk-l  *k  (x  + <f)kt  pk-l  <)’k)_1  ^ pk-l J (A4) 

This  equation  follows  from  the  matrix  inverse  identity 

(A  + BCD)'1  = A-1  - A_1B(C  + DA-1B)-1DA-1  (A5) 

which  holds  for  all  matrices  A,  B,  C,  and  D of  compatible  dimensions 
(for  nonsingularity  of  A).  Define 


Kk  = pk  h 

= X_1  [Pk-l  + Pk-l  ♦k-l  (x  + ^ Pk-l  *k)-1  V pk-li  +k 
= x 1 (Pk-l  <f>k)  [1  - (X  + Pk_i  (f^)-1  j>kt  Pk-l  *kl 
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= pk-l  *k  tx  + V pk-l 
Then  (A4)  can  be  rewritten  as 

Pk  = X_1  [pk-l  - Kk  ^ Pk_i) 

Substituting  (Al),  (A2),  (A4)f  and  (A6)  into  (3.11),  gives 

= l*k  \ ^k1]-1  tx  *k_l  \-i  Yk_i  + ^ yk] 

= X Pk  I*k-1  \-l  Yk-lJ  + pk  *k  Yk 
= [pk-l  ~ Kk  <f’kt  pk-lH*k_i  \-l  Yk_i  ] + Pk  'f’k  Yk 
= pk-l  *k-l  \-l  Yk_i  - Kk  pk_x  $k_1  Yk_i  + 

pk  *k  Yk 

= ©k_i  + Kk  [yk  - <(ikt  (A8) 

Using  equations  (A6),  (A7),  and  (A8),  the  weighted  RLS  algorithm  for 
ARMA  parameter  estimation  is  given  as  follows: 

Prediction  error:  = yk  - q^_1 

Gain  update:  Kk  = Pk_!  ^ [X  + ^ pk_1  ^j-l 

Parameter  update:  \ = 0k_1  + Kk 

Covariance  matrix:  Pk  = X"1  [P^  + Kk  4^  Pk_1  ] 

Input  estimate:  uk  = yk  - ^ 


(A6) 

(A7) 


(A9) 


APPENDIX  B 


DERIVATION  OF  THE  VARIABLE  FORGETTING  FACTOR 


The  derivation  of  the  VFF  is  based  on  [Fortescue,  1977].  From 
(3.9),  the  sum  of  the  weighted  residual  error  can  be  expressed  as 
Ek  = (Yk  - V ek)t  \ (Yk  - ey 

= V \ Yk  - Ykl  \ V % - ©k  *k  \ Yk  + 

®k  *k  \ ^ ©k  (Bl) 

Using  (3.11) 

*k  \ Yk  = %T  \ % % (B2) 

Yk1  \ V = ®k  % $Kt  (B3) 

Then 

Ek  = Yk1  Ak  Yk  - ^ ^ Aj^  ^ 

= Yk2  - * Yfc.i1  Yk_!  - ^ Pk-1  0k  (B4) 

where  Pk--*-  = $k  \ Sk* 

Let 

AEk  = Zr  “ x Ek-1  (B5) 

= Yk2  - % pk_1  K + X ©k-1  Pk-1-1  \-l  (B6) 

From  (A9) 

%.  = %t-l  + pk  *k  ^k  (B7) 

Then 

®kpk_1®k 

= [%t_i  + pk  4>k  pk-1  I%t-1  + pk  *k  *kJ 
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= ^c-l*  Pk1  3c- 1 + 2 3c  ♦k1  %_i  + 3c  '♦k1  pk  4k  (B8 ) 

where 

^k-l1  pk_1  3t-l 

= ^c-lt  t ^ pk-l  1 + 4k  <hct]  0k-l 

= X Qk.i1  Pk-l_1  3c_l  + ^c-l1  4k  4k 1 3c-l  (B9) 

Using  (B8)  and  (B9),  (B6)  is  written  as 

AEk  = yk2  “ 3c-lt  'he  <*’kt  3c-l  ~ ^k2  <f’kt  pk  '♦’k  ~ 

2 3c  ,hct  3c-l  (BIO) 

Using  the  prediction  error  in  (A9),  we  have 
3c  = Yk  - <*Tct  ©k-1 

^t2  = yk2  - 2 3c  3c-l  - ^c-]/  4k  4kt  9k_i  (Bll ) 

Substituting  (Bll)  into  (BIO),  gives 

AJk  = ^k2  (!  - 4kt  pk  4k)  (B12) 

Substituting  (B12)  into  (B5),  the  recursive  equation  for  the  sum  of  the 
weighted  squares  of  the  residual  errors  at  time  instant  k can  be 
expressed  as 

Ek  = x zk-l  + (1  - 4k*-  pk  4k)  3c2  (B13) 

= X Lk_!  + (1  - 4kl  Kk)  ^2 


(B14) 


APPENDIX  C 


DERIVATION  OF  THE  PARAMETER  UPDATING  EQUATION  FOR  WHITE  OR  PULSE  INPUT 

From  the  matrix  representation  of  (3.22),  we  can  express  the 
following  equations  recursively: 

\ Ukp  = X #lc_1  U^P  + 4^  ukP  (Cl) 

\ Yk  = x *k-l  \-l  Yk-1  + *k  Yk  (C2) 

Using  the  same  procedure  as  in  Appendix  A,  substituting  (Cl),  (C2), 

(A2),  (A4)  and  (A6)  into  (3.24),  we  obtain 

= X pk  I*k-1  \-l  Yk_i]  + Pk  +k  Yk  - X pk  I«k-1  \-l  Uk-lPJ 
- Pk  *k  °k 

= [pk-l  ~ Kk  <hct  pk-lH$k_i  Yk_! ] + pk  ‘f’k  Yk  ~ 

fpk-l  ~ Kk  <hct  pk-l  J t $k_i  Ac_i  Uk-1P ] - pk  ^ ukp 
= pk-l  *k-l  \-l  Yk-1  “ pk-l  *k-l  \-l  Uk_iP  - 
Kk  h1  pk-l  *k-l  \-l  Yk-1  + Kk  <f)kt  *k_l  \_i  Uk_iP  + 
pk  *k  yk  - pk  *k  °kP 

= ®k-l  ~ Kk  [Yk  - V 3c_l  - ukP]  (C3) 
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APPENDIX  D 


COMPLEXITY  ANALYSIS  OF  THE  WRLS-VFF  ALGORITHM 


Time  updating  equations  for 
the  WRLS-VFF  algorithm 

Adaptive  process: 

^ = yk  - ©k-i 

Kk  = pk-l  t \-i  + V pk-l  ‘f’kl-1 

Pk  = v1  tpk-l  + Kk  Pit-!  ] 

*k  = 1 - (1  - ^ Kk)  / zi0 

Input  estimate: 

a)  Pulse  input 

If  < Xq  then 
ukv  = 0 
Uk  = UkP 

= yk  - *k  ©k-i 

b)  White  noise  input 


Number  of  flops 

(multiplications  and  additions) 
n + 1 

2n2  + n +1 
n2 

n2  + 2n 


n + 1 


If  Xk  > Xq  then 

Up  = 0 

uk  = ukv 

= ^k  _ <f’kt  Kk)P22  square  root 

Parameters  update: 

9k  = ^t_i  + Kk  (yk  - V ^_!  - ukP)  n2  + n + 1 

Total  number  of  flops  required  (approximately):  5n2  + 6n 

Note:  n=p+q , where  p and  q are  the  order  of  an  ARMA  (p,q)  process. 
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APPENDIX  E 


COMPLEXITY  ANALYSIS  OF  THE  LS  MODIFIED  YULE-WALKER 
(LSMYV)  EQUATION  METHOD  FOR  ARMA  PARAMETER  ESTIMATION 


(1)  Block  diagram  of  the  LSMYW  method: 


SPEECH 


(2)  Complexity  analysis: 

Parameter  estimation  process 

AR  parameter  estimation  (aj) 
(Over-determined  YWE  method  with 
Choleskey  decomposition) 

LPC  inverse  filtering  (en) 

(FIR  filtering) 

High  order  LPC  analysis 
(Autocorrelation  method) 

Q-th  order  LPC  analysis  (b^) 
(Autocorrelation  method) 

Total  number  of  flops  required 
approximately  (N>>  p and  N>>q): 


Number  of  flops 
( 1/6 ) p3  + (N/2)p2 

(N-p)p  + p(p-l)/2 
(N/5)2  + (N/5)p 

q2  +(N/5)q 
(l/6)p3  + Np2/2  + Np 


Note:  p and  q are  the  order  of  the  AR  and  MA  process,  and  N is  the 
total  data  points  for  analysis. 
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APPENDIX  F 


EGG 


(a)  A system  configuration  for  the  electroglot tography  (EGG)  and 

(b)  the  EGG  waveform  output. 
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